Root Cause Analysis:
Incident: Elevated 5xx errors in Device Ingestion Services
Date Range: March 20, 2026 between 2:45 PM ET and 3:43 PM ET
Impact: All Device and Digital Intelligence SDK traffic was affected.
Status: Resolved
On March 20, 2026, between 2:45 PM ET and 3:43 PM ET, Socure's Device Ingestion service experienced approximately 58 minutes of unavailability. The incident was triggered by a sudden, significant surge in end-user app activity that revealed a gap in our traffic handling capacity and caused our ingestion service to become unavailable. We tried scaling up our services but the sustained pressure prevented new instances from remaining healthy. We had to temporarily pause ingress traffic to allow the service to stabilize. During this down time, the service stopped returning session tokens and as a result the SDK did not return tokens back to the customer.
| Date / Time (ET) | Event |
|---|---|
| 2:45 PM ET | Unexpected traffic surge begins, device ingestion service becomes unavailable |
| 2:50 PM ET | Team attempts to scale up the service, new instances are unable to remain healthy under sustained load |
| ~3:00 PM ET | Incident call initiated, on-call team begins investigation |
| ~3:10 PM ET | Team identifies the surge in traffic using ingress log analysis |
| 3:21 PM ET | Inbound traffic temporarily paused to allow service recovery |
| 3:39 PM ET | Inbound traffic restored |
| 3:43 PM ET | Service stabilizes and status page updated to resolved |
The device ingestion service is designed to accommodate typical traffic spikes, but the volume observed in this case significantly exceeded anticipated levels.
The device ingestion service was unable to absorb this sudden load. New service instances, upon becoming available, were immediately overwhelmed before they could fully initialize, creating a cycle that outpaced the service's automatic scaling mechanism. This resulted in elevated error rates across device and digital intelligence services, preventing customers from generating new device session tokens during the outage window.
It is worth noting that this incident was caused by a legitimate traffic surge from end-user activity. No software deployments or configuration changes were associated with the incident.
The team temporarily paused inbound traffic to the device ingestion service, allowing a scaled set of service instances which were provisioned with increased resources to initialize successfully. Traffic was restored at 3:39 PM ET and stabilized within minutes. The incident was declared resolved at 3:43 PM ET. Elevated resource settings were retained through the following weekend to accommodate continued elevated traffic.
Socure takes full accountability for this incident and the impact it had on our customers. We are implementing rate limiting and per customer traffic controls at the device ingestion layer to ensure that another situation like this cannot affect the availability of the shared service, and we remain committed to delivering the reliability our customers depend on.