Session: Infra & Capacity Planning to Support ML Growth
As machine learning systems transition from research prototypes to mission-critical products, scalable and resilient infrastructure becomes essential—not an afterthought. In this talk, we’ll demystify infrastructure and capacity planning for ML at scale, equipping you with practical frameworks to support not just current workloads, but sustainable growth.
You’ll learn how to:
Align ML needs with infrastructure strategy — understand the unique demands of training vs. inference, batch vs. real-time workflows, and cross-functional dependencies.
Plan capacity with clarity — learn techniques for forecasting compute, storage, and network needs, recognizing bottlenecks early, and choosing the right blend of cloud, on-prem, and hybrid architectures.
Optimize costs without compromising performance — apply workload characterization, autoscaling policies, and resource tiers to balance agility and budget.
Build observability and resilience — implement monitoring, alerting, and feedback loops that connect infrastructure health with business outcomes.
Future-proof your platform — establish patterns that adapt to new models, evolving data volumes, and organizational growth.
Through real scenarios and actionable guidance, this session bridges the gap between ML innovation and reliable delivery. Whether you’re an engineer, platform lead, or tech manager, you’ll walk away with frameworks to strengthen your infrastructure foundations and support sustainable ML growth.
Bio
Sandy Mohanakumar is a Capacity Planning Lead at OpenAI, where she drives strategic capacity forecasting and infrastructure planning to support large-scale machine learning growth. With 9+ years of experience in ML/AI infrastructure strategy, program leadership, and cross-functional execution — including prior work at Google — Sandy specializes in aligning compute, storage, and performance needs with evolving product and research demands. She is passionate about translating emerging ML innovation into resilient, scalable systems that empower teams and accelerate impact.