[Sev1] Elevated 5xx errors in Device Ingestion Services

Incident Report for Socure

Postmortem

Root Cause Analysis:
Incident: Elevated 5xx errors in Device Ingestion Services
Date Range: March 20, 2026 between 2:45 PM ET and 3:43 PM ET
Impact: All Device and Digital Intelligence SDK traffic was affected.
Status: Resolved

Summary

On March 20, 2026, between 2:45 PM ET and 3:43 PM ET, Socure's Device Ingestion service experienced approximately 58 minutes of unavailability. The incident was triggered by a sudden, significant surge in end-user app activity that revealed a gap in our traffic handling capacity and caused our ingestion service to become unavailable. We tried scaling up our services but the sustained pressure prevented new instances from remaining healthy. We had to temporarily pause ingress traffic to allow the service to stabilize. During this down time, the service stopped returning session tokens and as a result the SDK did not return tokens back to the customer.

Timeline

Date / Time (ET)	Event
2:45 PM ET	Unexpected traffic surge begins, device ingestion service becomes unavailable
2:50 PM ET	Team attempts to scale up the service, new instances are unable to remain healthy under sustained load
~3:00 PM ET	Incident call initiated, on-call team begins investigation
~3:10 PM ET	Team identifies the surge in traffic using ingress log analysis
3:21 PM ET	Inbound traffic temporarily paused to allow service recovery
3:39 PM ET	Inbound traffic restored
3:43 PM ET	Service stabilizes and status page updated to resolved

Description of Incident

The device ingestion service is designed to accommodate typical traffic spikes, but the volume observed in this case significantly exceeded anticipated levels.

The device ingestion service was unable to absorb this sudden load. New service instances, upon becoming available, were immediately overwhelmed before they could fully initialize, creating a cycle that outpaced the service's automatic scaling mechanism. This resulted in elevated error rates across device and digital intelligence services, preventing customers from generating new device session tokens during the outage window.

It is worth noting that this incident was caused by a legitimate traffic surge from end-user activity. No software deployments or configuration changes were associated with the incident.

Resolution

The team temporarily paused inbound traffic to the device ingestion service, allowing a scaled set of service instances which were provisioned with increased resources to initialize successfully. Traffic was restored at 3:39 PM ET and stabilized within minutes. The incident was declared resolved at 3:43 PM ET. Elevated resource settings were retained through the following weekend to accommodate continued elevated traffic.

Mitigation Plan

Improve traffic forecasting and capacity planning in collaboration with customers, so that anticipated high-traffic periods are accounted for in advance and the service is scaled accordingly.
Implement service-level load shedding effectively so that the ingestion service can continue to handle traffic gracefully under high load while additional capacity is being provisioned.
Implement per-customer rate limiting so that a traffic surge from one or more customers does not affect the availability of the shared service.
Reduce service instance startup and readiness times so that additional capacity can be brought online more quickly during scaling events.
Improve alerting and monitoring at the service ingress layer to notify the team earlier when traffic surges are detected.

Commitment

Socure takes full accountability for this incident and the impact it had on our customers. We are implementing rate limiting and per customer traffic controls at the device ingestion layer to ensure that another situation like this cannot affect the availability of the shared service, and we remain committed to delivering the reliability our customers depend on.

Posted Mar 24, 2026 - 14:25 EDT

Resolved

We have mitigated the issue affecting our Device Ingestion Services and traffic has returned to normal.

Our team is continuing to monitor the situation and will provide an RCA as soon as possible.

Posted Mar 20, 2026 - 15:54 EDT

Investigating

We are experiencing an issue causing elevated 5xx errors impacting Device Ingestion Services. Our team is actively investigating and we will continue to provide updates on this incident.

Posted Mar 20, 2026 - 15:15 EDT

This incident affected: Socure SDKs (SigmaDevice iOS, SigmaDevice Android, Device Risk - WebSDK).