RETROACTIVE: RiskOS API Degradation

Incident Report for Socure

Postmortem

Root Cause Analysis

Incident: RiskOS API Services Degradation During NAT Gateway Remediation

Date Range: 06-16-2026

Impact: Intermittent degradation for a subset of API requests

Status: Recovered

1. Summary

On June 16th, 2026, RiskOS API services experienced intermittent degradation for a subset of API requests between 3:46 AM EDT and 4:02 AM EDT. During this period, some requests failed while network connectivity was being transitioned as part of a controlled remediation.

Earlier that day, Socure observed elevated connection error alerts and identified critically high Cloud NAT port utilization in the RiskOS commercial environment. The condition was not impacting existing customer transactions at the time of detection, but it created a significant capacity risk for outbound connectivity if additional workloads restarted, scaled, or deployed.

To reduce the risk of broader impact, Socure performed a controlled network remediation during a lower-workload window. The remediation improved capacity handling and reduced shared NAT resource contention across origin IP blocks. During the transition, some existing connections were briefly interrupted and had to be re-established, resulting in intermittent request failures.

The remediation activity completed in approximately 5 minutes, and overall recovery and stabilization completed within approximately 15 minutes. No customer action was required after connectivity was restored.

2. Timeline

June 16th, 2026 / Time Event
12:00 AM EDT The Socure Engineering team observed elevated connection error alerts and began an investigation. No customer transaction impact was identified at this time.
12:30 AM EDT Investigation identified critically high Cloud NAT port utilization, creating a capacity risk for outbound connectivity. SRE confirmed that existing customer transactions were not being impacted at that time.
12:30–3:45 AM EDT The Socure Engineering team evaluated remediation options, assessed the risk of leaving the configuration unchanged, and planned remediation during a lower-workload window to reduce potential customer impact.
3:45 AM EDT The engineering team proceeded with the planned network capacity remediation.
3:46 AM EDT Intermittent request failures began as some existing network connections were briefly interrupted during the transition to the updated configuration.
3:52–4:00 AM EDT Network configuration updates were applied to improve capacity handling and reduce shared NAT resource contention across origin IP blocks.
4:02 AM EDT Intermittent request failures ended after traffic completed transition to the updated configuration and connections were re-established.
~4:02 AM EDT Platform connectivity stabilized.
~4:15 AM EDT Incident considered resolved after validation and monitoring confirmed recovery.

3. Root Cause

Primary Root Cause:The incident was caused by critically high Cloud NAT port utilization in the RiskOS commercial environment. The NAT configuration had insufficient available port capacity to provide safe headroom for normal connection churn, workload restarts, scaling activity, or deployments.

Contributing Factors:

  • The residual NAT capacity risk was a consequence of emergency recovery activities undertaken after the June 9 RiskOS service outage, where the primary focus was restoring service availability.
  • NAT port utilization reached approximately 99–100%, creating a significant risk of outbound connectivity degradation.
  • NAT capacity was shared across origin IP blocks, which increased the likelihood of port exhaustion under workload pressure.
  • During remediation, some existing connections were reset while traffic transitioned to the updated network configuration, resulting in temporary intermittent request failures.

4. Resolution

Socure updated the NAT configuration to improve capacity handling and reduce shared-resource contention across origin IP blocks. The remediation was performed because the existing NAT utilization level presented a significant reliability risk, and delaying the change could have increased the chance of broader service degradation.

During the transition, some existing connections were reset and had to be re-established. Once the updated configuration took effect, intermittent request failures stopped and platform connectivity stabilized.

No customer action was required after connectivity was restored. Requests retried after recovery were expected to process normally.

5. Corrective and Preventive Actions

Action Description ETA / Status
Reduce NAT capacity contention NAT usage was separated to reduce shared capacity risk and limit the blast radius of future NAT capacity issues. Completed
Enable dynamic port allocation Dynamic port allocation was enabled to improve NAT port capacity management and reduce exhaustion risk. Completed
Improve NAT utilization monitoring Add or tune monitoring for NAT port utilization, dropped packets, and connection errors with actionable alert thresholds before utilization reaches critical levels. Completed
Move network configuration to Infrastructure as Code Continue moving RiskOS network configuration into version-controlled Infrastructure as Code to improve reviewability, repeatability, and recovery as a part of AWS Migration. Planned

6. Lessons Learned

  • NAT port utilization should be treated as a critical reliability signal for workloads that depend on outbound connectivity.
  • NAT capacity should have sufficient headroom for normal connection churn, workload restarts, scaling events, and deployments.
  • Remediation of high-risk network conditions can still create temporary impact when existing connections are reset, so change planning should include clear validation and communication steps.
  • Network changes should include pre-change checks for current utilization, available NAT capacity, expected connection reset behavior, and rollback options.
  • Continued investment in environment isolation, Infrastructure as Code, and stronger monitoring will reduce the likelihood and impact of similar incidents.

7. Next Steps & Ongoing Commitment

Socure recognizes that customers rely on RiskOS to be available and reliable. We take accountability for the disruption caused during this remediation and are continuing to improve our network architecture and operational controls.

To reduce the likelihood of recurrence, Socure is pursuing the following:

  1. NAT Capacity Controls: We are improving monitoring and alerting for NAT utilization, connection drops, and port exhaustion risk so that capacity issues can be identified and remediated before they impact customer traffic.
  2. Infrastructure as Code: We are continuing to move RiskOS network configuration to a version-controlled model so that every change is reviewed, repeatable, and can be recovered quickly.
  3. Environment Isolation: We are separating network resources and reducing shared capacity dependencies so that the impact of any capacity issue or network change is contained.
  4. Safer Network Operations: We are strengthening pre-change validation, rollback planning, and maintenance window selection for network changes that may affect customer traffic.
  5. Operational Runbooks: We are updating operational runbooks for NAT capacity exhaustion and Cloud NAT changes so that response and recovery steps are repeatable and clearly documented.
Posted Jun 19, 2026 - 12:14 EDT

Resolved

On June 16, 2026, from 03:45 AM to 04:03 AM EDT, RiskOS APIs experienced elevated error rates. The issue is fully resolved and all systems are now operating normally.

The degradation occurred during remediation work tied to an networking error alert. A subset of API requests returned 5XX errors during this window, though the service remained partially available throughout.

No customer action is required. A full Root Cause Analysis (RCA) will be shared once the investigation concludes.
Posted Jun 16, 2026 - 03:30 EDT