Root Cause Analysis
Incident: RiskOS Network Connectivity Loss
Date Range: 06-09-2026
Impact: Sev1
Status: Recovered
1. Summary
On June 9th, 2026 between 17:59 EST and 20:00 EST, network connectivity for RiskOS was disrupted in our commercial environment, resulting in widespread 5xx errors, limited webhook connection problems, and platform unavailability for multiple customers. The issue was linked to a recent GCP network change, which was reverted after discovery. As network connectivity was impacted, API calls never reached Socure, and so could not be recovered after the incident. Customer retries that happened after connectivity was restored worked as intended. API functionality began recovering around 18:41 EST, with dashboard recovery beginning at 19:10 EST, with full resolution at 20:00 EST.
2. Timeline
3. Root Cause
- Primary Root Cause: Unintentional deletion of Cloud Router in production Google Cloud Platform account caused deletion of Cloud NAT resources for RiskOS Sandbox and Production networks, thereby severing outbound connectivity.
Contributing Factors:
- Google Cloud configurations are currently manually managed, contributing to higher accidental error rate.
- In our existing configuration, the central networking resources are defined with production and sandbox resources, increasing the blast radius of core network modifications. In this case, the change unintentionally impacted production resources.
4. Resolution
Socure acted to restore the deleted Cloud NAT resources, as well as re-attaching known elastic IP addresses to network egress to prevent customer action. Additionally, select services (the backends to our dashboard case listing and search) were intolerant of network interruption, and required manual restart.
5. Corrective and Preventive Actions
6. Lessons Learned
- The Cloud NATs which were deleted were unrelated to the scope of the network changes planned in the maintenance. Separating the production configuration and implementing our standard high-scrutiny change process is the highest priority.
- Configure resource protection for critical infrastructure components to prevent unintentional changes.
- Conduct network operations during non US business hours for more availability of support resources in the event of service impact.
- Leverage infrastructure as code for absolute configuration and to ensure resource repeatability, as well as facilitate human review for dangerous operations.
7. Next Steps & Ongoing Commitment
Socure knows customers rely on it to be available and accurate, and takes full accountability for this configuration error and the disruption it caused. To ensure this problem does not recur, we are pursuing the following steps:
- Resource Protection: We are adding deletion and change-protection controls to critical network infrastructure so that an accidental change cannot disrupt the platform.
- Infrastructure as Code: We are already moving RiskOS network configuration to a version-controlled model so that every change is reviewed, repeatable, and can be recovered quickly within Socure’s AWS environment.
- Environment Isolation: We are separating workload and core platform network resources so that the impact of any single change is contained.
- Safer maintenance windows: We are scheduling network maintenance during periods with the strongest engineering and support coverage.