Root Cause Analysis
Date: March 10, 2026
On March 10, 2026, between 4:53 PM and 5:03 PM ET, RiskOS experienced a brief service degradation impacting the Evaluation APIs.
During a period of concentrated peak request volume, a high-traffic database table reached a performance scaling threshold. Over the past six months, organic growth in the metadata table increased the computational effort required to process certain requests. This growth, combined with a sudden burst in concurrency, led to temporary database CPU saturation.
The incident followed an automated recovery path as the system cleared the request backlog once peak concurrency levels normalized. Following the recovery, Socure Engineering implemented architectural hardening measures to the data access layer, achieving a 99% reduction in latency for the query pattern that contributed to the incident, ensuring the platform can accommodate future growth without recurrence.
| Time (ET) | Event |
|---|---|
| 04:53 PM | Elevated latency and degraded availability began affecting the RiskOS Evaluation APIs. |
| 04:54 PM | Automated monitoring triggered alerts for increased 4xx/5xx response codes. The on-call engineering team was paged and began investigating. |
| 04:55 PM | A second alert for increased API latency was triggered. |
| 04:59 PM | A database monitoring alert was triggered, indicating a sudden spike in CPU utilization on the datastore. |
| 05:00 PM – 05:03 PM | The CPU alert helped the team isolate the problematic query pattern within the Evaluation PATCH endpoint. As the burst of requests causing the degradation subsided, API latency and error rates started to return to normal. |
| 05:03 PM | System performance fully stabilized. The Evaluation APIs resumed normal operation without further elevated error rates or latency. The incident was closed, and a post-incident review was initiated. |
During the ~10-minute window:
Approximately two-thirds of evaluation requests completed successfully during the incident window. The remainder returned client/server errors or timed out due to the database reaching temporary saturation limits. Performance stabilized and systems returned to normal operating levels automatically as the traffic burst abated.
Primary Root Cause: A performance scaling threshold was reached for Evaluation PATCH endpoint due to high latencies on an underlying dataset. As the underlying dataset grew organically, a specific query pattern, previously efficient, became computationally expensive. When met with a concentrated peak in concurrent request volume, the database CPU reached saturation, leading to elevated API latency and timeouts.
Contributing Factors:
Non-contributing Factors:
The incident followed an automated recovery path. As the concurrency burst subsided, the database successfully cleared the request backlog, and API performance returned to standard operating levels without manual intervention.
Following the stabilization, Socure Engineering conducted a performance audit and implemented the following system hardening measures to prevent a recurrence:
| Action | Description | Status/ ETA |
|---|---|---|
| Database Capacity Guardrails | Strengthen database monitoring to detect "cost-per-query" trends earlier, before they impact API latency. | Completed on March 11th |
| Load Distribution Enhancements | Conduct a full efficiency audit of all high-growth datasets to ensure data access remains optimized | Completed on March 11th |
| Enhanced Query Controls | Refine query timeout settings to prevent unnecessary request amplification during short-lived performance slowdowns. | In-Progress / Target ETA: March 20th |
Socure takes accountability for the Evaluation API degradation on March 10. Maintaining consistent, reliable API performance as our customers scale is our top priority.
We have successfully optimized the identified data access paths and are strengthening our monitoring and query controls to further improve system resilience. The Evaluation APIs are performing within normal thresholds, and we remain committed to ensuring the platform is equipped for continued growth.