Session: The Critical First Minutes: When Small Failures Become Major Incidents
One of your services starts returning errors at a low rate. Shortly after, multiple systems are down and your on-call team is scrambling to understand what happened.
This talk examines how failures compound in those critical early minutes through real incident post-mortems. We'll trace common patterns: retry logic that amplifies problems, health checks that miss actual service degradation, and timeouts that cascade across dependencies. You'll learn to recognize when a small issue is about to become a major incident.
You'll walk away with practical decision frameworks for those first minutes: when to restart services versus letting them stabilize, how to detect degradation before it spreads, and which automated responses actually help during failures.
Because what you do in those critical first minutes often determines whether you have a brief incident or a long outage.
Bio
Aishvaryaa is a Senior Software Engineer at Apple, where she builds network traffic components for Apple Services. Previously at AWS, she engineered the NAT Gateway control plane, building reliability directly into its customer-facing APIs. Her work extended to highly distributed systems handling network traffic for Lambda, NLB, and PrivateLink that serve millions of users globally. With 8+ years in the networking domain, she has responded to countless production incidents involving cascading failures and runaway client behavior. She specializes in implementing practical safeguards strengthened by metrics and observability, enabling teams to deploy confidently in highly distributed, high-traffic environments.