Session: The Critical First Minutes: When Small Failures Become Major Incidents
One of your services starts returning errors at a low rate. Shortly after, multiple systems are down and your on-call team is scrambling to understand what happened.
This talk examines how failures compound in those critical early minutes through real incident post-mortems. We'll trace common patterns: retry logic that amplifies problems, health checks that miss actual service degradation, and timeouts that cascade across dependencies. You'll learn to recognize when a small issue is about to become a major incident.
You'll walk away with practical decision frameworks for those first minutes: when to restart services versus letting them stabilize, how to detect degradation before it spreads, and which automated responses actually help during failures.
Because what you do in those critical first minutes often determines whether you have a brief incident or a long outage.
Bio
Gnanaguruparan Aishvaryaadevi is a Senior Software Engineer specializing in hyperscale network infrastructure architecture. With 8+ years engineering distributed systems that manage billions of concurrent connections for globally deployed services, she has engineered solutions for ultra-low latency packet processing, stateful traffic migration at planetary scale, and zero-downtime deployments where microseconds matter and failures impact millions of users. Her work spans control plane architecture, userspace networking optimization, and reliability engineering for mission-critical infrastructure.