RiskOS Service Outage

Incident Report for Socure

Postmortem

Root Cause Analysis

Incident: RiskOS Network Connectivity Loss

Date Range: 06-09-2026
Impact: Sev1
Status: Recovered

1. Summary

On June 9th, 2026 between 17:59 EST and 20:00 EST, network connectivity for RiskOS was disrupted in our commercial environment, resulting in widespread 5xx errors, limited webhook connection problems, and platform unavailability for multiple customers. The issue was linked to a recent GCP network change, which was reverted after discovery. As network connectivity was impacted, API calls never reached Socure, and so could not be recovered after the incident. Customer retries that happened after connectivity was restored worked as intended. API functionality began recovering around 18:41 EST, with dashboard recovery beginning at 19:10 EST, with full resolution at 20:00 EST.

2. Timeline

3. Root Cause

  • Primary Root Cause: Unintentional deletion of Cloud Router in production Google Cloud Platform account caused deletion of Cloud NAT resources for RiskOS Sandbox and Production networks, thereby severing outbound connectivity.
  • Contributing Factors:

    • Google Cloud configurations are currently manually managed, contributing to higher accidental error rate.
    • In our existing configuration, the central networking resources are defined with production and sandbox resources, increasing the blast radius of core network modifications. In this case, the change unintentionally impacted production resources.

4. Resolution

Socure acted to restore the deleted Cloud NAT resources, as well as re-attaching known elastic IP addresses to network egress to prevent customer action. Additionally, select services (the backends to our dashboard case listing and search) were intolerant of network interruption, and required manual restart.

5. Corrective and Preventive Actions

6. Lessons Learned

  • The Cloud NATs which were deleted were unrelated to the scope of the network changes planned in the maintenance. Separating the production configuration and implementing our standard high-scrutiny change process is the highest priority.
  • Configure resource protection for critical infrastructure components to prevent unintentional changes.
  • Conduct network operations during non US business hours for more availability of support resources in the event of service impact.
  • Leverage infrastructure as code for absolute configuration and to ensure resource repeatability, as well as facilitate human review for dangerous operations.

7. Next Steps & Ongoing Commitment

Socure knows customers rely on it to be available and accurate, and takes full accountability for this configuration error and the disruption it caused. To ensure this problem does not recur, we are pursuing the following steps:

  1. Resource Protection: We are adding deletion and change-protection controls to critical network infrastructure so that an accidental change cannot disrupt the platform.
  2. Infrastructure as Code: We are already moving RiskOS network configuration to a version-controlled model so that every change is reviewed, repeatable, and can be recovered quickly within Socure’s AWS environment.
  3. Environment Isolation: We are separating workload and core platform network resources so that the impact of any single change is contained.
  4. Safer maintenance windows: We are scheduling network maintenance during periods with the strongest engineering and support coverage.
Posted Jun 12, 2026 - 17:10 EDT

Resolved

The issue impacting RiskOS has been resolved, and service has returned to normal operation. Customers should no longer experience errors, degraded performance, or delays when accessing RiskOS services and workflows.

Our teams have confirmed system stability and will continue to monitor performance closely.

We appreciate your patience throughout this incident and apologize for the impact this outage may have had on your operations.
Posted Jun 09, 2026 - 20:11 EDT

Identified

RiskOS API and real-time processing services have been fully operational for all customers since 07:17 PM ET.

- Dashboard functionality, including case listing and search, has been restored for customers. Some customers may experience intermittent delays in seeing new case updates or decision changes reflected in the dashboard.
- Webhook deliveries remain impacted for a small subset of customers, and our team is actively working toward a full resolution.

We will continue to monitor system health and provide additional updates as we confirm complete service restoration.

Thank you for your patience and understanding.
Posted Jun 09, 2026 - 19:52 EDT

Investigating

We are currently investigating a system-wide issue impacting RiskOS. Customers may experience errors, degraded performance, or delays when accessing RiskOS services and workflows.

Our engineering teams are actively working to identify the root cause and restore full functionality as quickly as possible.
We will provide additional updates as more information becomes available.
Posted Jun 09, 2026 - 18:24 EDT
This incident affected: RiskOS Platform (Sigma Identity Fraud, Sigma Synthetic Fraud, Predictive DocV, KYC, Email RiskScore, Address RiskScore, Account Intelligence, Phone RiskScore, Sigma Device, Alert List, Decision, eCBSV, Negative Positive List, Graph Intelligence, Sigma First-Party Fraud, Phone RiskScore, Verify (eKYC), Prefill, CNAM, OTP, Silent Network Authentication (SNA)) and RiskOS.