Session: Evaluating the Unpredictable: Observability for Production-grade LLM Agents
As LLM-powered applications transition from demos to production, teams encounter failure modes that traditional observability tools were never designed to handle: non-deterministic reasoning, tool misuse, silent hallucinations, evaluation blind spots, and unpredictable cost explosions.
This talk presents a production-first evaluation and observability framework for LLM agents, grounded in real world. We show how to define agent-specific evaluation criteria and wire them directly into tracing and feedback loops, turning evaluation from a one-time exercise into a continuous system.
Attendees will learn how to:
- Define evaluation criteria beyond accuracy, including tool correctness, reasoning validity, outcome relevancy, latency, cost efficiency; and decide which criteria to use when
- Detect hidden failure modes such as silent hallucinations, incorrect tool selection, partial task completion, and cascading agent errors
- Combine automated evals with human-in-the-loop signals to validate edge cases and continuously recalibrate scoring thresholds
- Use tracing and structured telemetry to correlate eval failures with specific prompts, tools, or reasoning steps
Additionally we will walk through a real world LLM agent use case, illustrating how evaluation signals interact with end-to-end traces to surface issues that would otherwise evade detection until user trust is lost.
By the end of the talk, attendees will have a clear blueprint for operationalizing evaluation in LLM agents, enabling teams to ship systems that are not only impressive in demos, but reliable in production.
Bio
Vrinda Bhatia is a seasoned software engineer and AI builder with over a decade of experience in companies like AWS and Block. Most recently, she is working as a Senior Software Engineer at Block, developing infrastructure for ML inference, where her work helped prevent over $220M in fraud losses in 2024. Before that she was in AWS AppStream - a secure application streaming service. Her work was critical during the COVID-19 pandemic, empowering organizations like Washington State Pandemic Center, and Los Angeles County to transition thousands of students and employees to secure, remote environments. She is also a key contributor in an open source library for model distillation (https://github.com/horus-ai-labs/DistillFlow/) which has gotten over 150+ stars on Github. She is passionate about solving real-world problems at scale. Beyond the technical work she loves sharing knowledge with the developer community through talks, mentorship, and open collaboration.
Sujata Sridharan is a Senior Software Engineer at Bolt Financial, where she builds AI-driven commerce infrastructure that powers next-generation e-commerce experiences. With nearly a decade of experience spanning Microsoft, Amazon, and Bolt Financial, she specializes in architecting reliable, compliant, and human-centered AI systems, from large-scale identity and security platforms to production-grade LLM infrastructure powering over a billion dollars in transactions. Beyond her engineering work, Sujata is an active mentor and community builder, guiding emerging AI practitioners through workshops, hackathons, and speaking engagements such as DevFest DC. Her current focus is on developing practical frameworks that make trustworthy AI both measurable and scalable across organizations.