Root Cause Analysis
Incident: RiskOS API Services Degradation During NAT Gateway Remediation
Date Range: 06-16-2026
Impact: Intermittent degradation for a subset of API requests
Status: Recovered
On June 16th, 2026, RiskOS API services experienced intermittent degradation for a subset of API requests between 3:46 AM EDT and 4:02 AM EDT. During this period, some requests failed while network connectivity was being transitioned as part of a controlled remediation.
Earlier that day, Socure observed elevated connection error alerts and identified critically high Cloud NAT port utilization in the RiskOS commercial environment. The condition was not impacting existing customer transactions at the time of detection, but it created a significant capacity risk for outbound connectivity if additional workloads restarted, scaled, or deployed.
To reduce the risk of broader impact, Socure performed a controlled network remediation during a lower-workload window. The remediation improved capacity handling and reduced shared NAT resource contention across origin IP blocks. During the transition, some existing connections were briefly interrupted and had to be re-established, resulting in intermittent request failures.
The remediation activity completed in approximately 5 minutes, and overall recovery and stabilization completed within approximately 15 minutes. No customer action was required after connectivity was restored.
| June 16th, 2026 / Time | Event |
|---|---|
| 12:00 AM EDT | The Socure Engineering team observed elevated connection error alerts and began an investigation. No customer transaction impact was identified at this time. |
| 12:30 AM EDT | Investigation identified critically high Cloud NAT port utilization, creating a capacity risk for outbound connectivity. SRE confirmed that existing customer transactions were not being impacted at that time. |
| 12:30–3:45 AM EDT | The Socure Engineering team evaluated remediation options, assessed the risk of leaving the configuration unchanged, and planned remediation during a lower-workload window to reduce potential customer impact. |
| 3:45 AM EDT | The engineering team proceeded with the planned network capacity remediation. |
| 3:46 AM EDT | Intermittent request failures began as some existing network connections were briefly interrupted during the transition to the updated configuration. |
| 3:52–4:00 AM EDT | Network configuration updates were applied to improve capacity handling and reduce shared NAT resource contention across origin IP blocks. |
| 4:02 AM EDT | Intermittent request failures ended after traffic completed transition to the updated configuration and connections were re-established. |
| ~4:02 AM EDT | Platform connectivity stabilized. |
| ~4:15 AM EDT | Incident considered resolved after validation and monitoring confirmed recovery. |
Primary Root Cause:The incident was caused by critically high Cloud NAT port utilization in the RiskOS commercial environment. The NAT configuration had insufficient available port capacity to provide safe headroom for normal connection churn, workload restarts, scaling activity, or deployments.
Contributing Factors:
Socure updated the NAT configuration to improve capacity handling and reduce shared-resource contention across origin IP blocks. The remediation was performed because the existing NAT utilization level presented a significant reliability risk, and delaying the change could have increased the chance of broader service degradation.
During the transition, some existing connections were reset and had to be re-established. Once the updated configuration took effect, intermittent request failures stopped and platform connectivity stabilized.
No customer action was required after connectivity was restored. Requests retried after recovery were expected to process normally.
| Action | Description | ETA / Status |
|---|---|---|
| Reduce NAT capacity contention | NAT usage was separated to reduce shared capacity risk and limit the blast radius of future NAT capacity issues. | Completed |
| Enable dynamic port allocation | Dynamic port allocation was enabled to improve NAT port capacity management and reduce exhaustion risk. | Completed |
| Improve NAT utilization monitoring | Add or tune monitoring for NAT port utilization, dropped packets, and connection errors with actionable alert thresholds before utilization reaches critical levels. | Completed |
| Move network configuration to Infrastructure as Code | Continue moving RiskOS network configuration into version-controlled Infrastructure as Code to improve reviewability, repeatability, and recovery as a part of AWS Migration. | Planned |
Socure recognizes that customers rely on RiskOS to be available and reliable. We take accountability for the disruption caused during this remediation and are continuing to improve our network architecture and operational controls.
To reduce the likelihood of recurrence, Socure is pursuing the following: