RETROACTIVE: [SEV1] RiskOS - Temporary Spike in HTTP 4xx and 5xx Responses

Incident Report for Socure

Postmortem

Root Cause Analysis
Date: March 10, 2026

Summary

On March 10, 2026, between 4:53 PM and 5:03 PM ET, RiskOS experienced a brief service degradation impacting the Evaluation APIs.

During a period of concentrated peak request volume, a high-traffic database table reached a performance scaling threshold. Over the past six months, organic growth in the metadata table increased the computational effort required to process certain requests. This growth, combined with a sudden burst in concurrency, led to temporary database CPU saturation.

The incident followed an automated recovery path as the system cleared the request backlog once peak concurrency levels normalized. Following the recovery, Socure Engineering implemented architectural hardening measures to the data access layer, achieving a 99% reduction in latency for the query pattern that contributed to the incident, ensuring the platform can accommodate future growth without recurrence.

Timeline

Time (ET) Event
04:53 PM Elevated latency and degraded availability began affecting the RiskOS Evaluation APIs.
04:54 PM Automated monitoring triggered alerts for increased 4xx/5xx response codes. The on-call engineering team was paged and began investigating.
04:55 PM A second alert for increased API latency was triggered.
04:59 PM A database monitoring alert was triggered, indicating a sudden spike in CPU utilization on the datastore. 
05:00 PM – 05:03 PM The CPU alert helped the team isolate the problematic query pattern within the Evaluation PATCH endpoint. As the burst of requests causing the degradation subsided, API latency and error rates started to return to normal.
05:03 PM System performance fully stabilized. The Evaluation APIs resumed normal operation without further elevated error rates or latency. The incident was closed, and a post-incident review was initiated.

Impact

During the ~10-minute window:

  • Elevated latency for Evaluation API requests.
  • Increased 4xx and 5xx error responses.
  • Intermittent request failures as the system prioritized resource stability.

Approximately two-thirds of evaluation requests completed successfully during the incident window. The remainder returned client/server errors or timed out due to the database reaching temporary saturation limits. Performance stabilized and systems returned to normal operating levels automatically as the traffic burst abated.

Root Cause

Primary Root Cause: A performance scaling threshold was reached for Evaluation PATCH endpoint due to high latencies on an underlying dataset. As the underlying dataset grew organically, a specific query pattern, previously efficient, became computationally expensive. When met with a concentrated peak in concurrent request volume, the database CPU reached saturation, leading to elevated API latency and timeouts.

Contributing Factors:

  • Dataset Growth: As the dataset grew over time, the existing query pattern became more expensive to execute, reducing the system’s performance headroom during traffic spikes.
  • Concurrency Intensity: A sustained surge in evaluation request volume acted as a catalyst, surfacing the efficiency bottleneck that was otherwise unobservable during standard traffic levels.
  • Retry Amplification: Standard automated retry behavior during the initial slowdown increased the total request volume, further compounding the pressure on database resources.

Non-contributing Factors:

  • Infrastructure Availability: There was no evidence of underlying infrastructure failure, hardware downtime, or regional cloud outages.
  • Rate-Limit Configuration: The incident was not caused by a breach of configured rate limits; traffic remained within the platform’s aggregate capacity boundaries.

Resolution

The incident followed an automated recovery path. As the concurrency burst subsided, the database successfully cleared the request backlog, and API performance returned to standard operating levels without manual intervention.

Following the stabilization, Socure Engineering conducted a performance audit and implemented the following system hardening measures to prevent a recurrence:

  • Query Pattern Optimization: Deployed a targeted indexing strategy to align the database execution plan with current data volumes. This achieved a 99% reduction in latency for the query pattern that contributed to the incident.
  • Storage Efficiency Maintenance: Executed a comprehensive cleanup of the high-traffic datasets to minimize I/O overhead and maximize available CPU headroom for future traffic bursts.
  • Performance Validation: Confirmed with simulated sustained surge that the Evaluation APIs are performing consistently within standard operating thresholds and that the specific data access bottleneck has been resolved.

Corrective and Preventive Actions

Action Description Status/ ETA
Database Capacity Guardrails Strengthen database monitoring to detect "cost-per-query" trends earlier, before they impact API latency. Completed on March 11th
Load Distribution Enhancements Conduct a full efficiency audit of all high-growth datasets to ensure data access remains optimized Completed on March 11th
Enhanced Query Controls Refine query timeout settings to prevent unnecessary request amplification during short-lived performance slowdowns. In-Progress / Target ETA: March 20th

Next Steps & Ongoing Commitment

Socure takes accountability for the Evaluation API degradation on March 10. Maintaining consistent, reliable API performance as our customers scale is our top priority.

We have successfully optimized the identified data access paths and are strengthening our monitoring and query controls to further improve system resilience. The Evaluation APIs are performing within normal thresholds, and we remain committed to ensuring the platform is equipped for continued growth.

Posted Mar 13, 2026 - 17:21 EDT

Resolved

On March 10th, 2026 between 4:53 PM - 5:03 PM (ET), the RiskOS platform experienced a temporary spike in HTTP 4xx (including 409 conflicts) and 5xx errors across RiskOS APIs. The errors were driven by elevated latency, which resulted in request timeouts and retries.

The issue has now been fully resolved. Our team is conducting a root cause analysis and will share a detailed update as soon as it is available. We apologize for any inconvenience this may have caused and appreciate your patience.
Posted Mar 10, 2026 - 18:00 EDT
This incident affected: RiskOS.