Session: Beyond Manual Tuning: How RL Agents Learned to Optimize Apache Spark
Apache Spark's performance depends heavily on configuration parameters like shuffle partitions, memory allocation, and parallelism settings. Data engineering teams typically rely on static defaults (which rarely match workload reality) or time-consuming manual tuning that doesn't adapt as data patterns evolve. A configuration optimized for small daily reports fails catastrophically on massive end-of-month aggregations—yet tuning each workload variant manually is unsustainable as organizations process increasingly diverse datasets.
This talk demonstrates how reinforcement learning transforms Spark configuration from a manual bottleneck into an autonomous, adaptive system. We built a Q-learning agent that observes dataset characteristics (size, cardinality, skew), experiments with different configurations, measures performance, and learns optimal settings for varying workload patterns—developing expertise comparable to experienced engineers but with perfect memory and systematic exploration.
Through comparative experiments, we show that combining our RL agent with Spark's Adaptive Query Execution (AQE) delivers 46-68% performance improvements over AQE alone. The RL agent provides pre-execution intelligence by selecting optimal initial configurations, while AQE handles runtime adaptations—addressing complementary optimization opportunities.
We then extend this to Multi-Agent Reinforcement Learning (MARL), where specialized agents independently optimize different domains: partitions, memory allocation, CPU cores, and caching strategies. Each agent becomes an expert in its area while collectively achieving comprehensive workload optimization, demonstrating a practical path toward intelligent, self-tuning big data infrastructure.
Key Takeaways:
1.How Q-learning agents learn from job execution feedback to build configuration policies
2. Why hybrid RL+AQE optimization outperforms either approach alone
3. Multi-agent architecture for scaling autonomous optimization across all Spark parameters
4. Practical deployment strategies for production environments
Bio
Hina is a technical leader with extensive experience in designing and developing scalable, high-performance applications. She holds a Master’s degree in Information Systems and a Bachelor’s degree in Computer Science Engineering. Over the years, she has demonstrated her technical expertise through impactful roles at Cisco Systems, VMware, and CloudHealth Technologies, excelling in areas such as cloud-based microservices, big data platforms, and SaaS solution development.In recognition of her leadership and technical impact, Hina was named the 2025 Women in Tech – Software Engineering Leader of the Year. Beyond her industry contributions, she is passionate about giving back to the community by mentoring students and delivering guest lectures at universities, inspiring the next generation of technology professionals.