The Invisible Engine: Optimizing AI for the Cloud Without Breaking the Bank by Kaustubha V
Kaustubha V
Solution ArchitectReviews
Optimizing AI in the Cloud: Cost Efficiency and Sustainability
In today’s rapidly evolving technological landscape, leveraging AI in the cloud has become vital. However, organizations face significant challenges regarding cost and sustainability. This article will explore effective strategies for achieving cost-efficient AI deployments while ensuring responsible AI practices. Let’s dive into the core issues and techniques that can help organizations optimize their AI workloads.
Understanding the Challenges of AI in the Cloud
As organizations increasingly adopt cloud-based AI solutions, they encounter numerous challenges, including:
- Resource Intensity: AI workloads often require substantial computational resources, leading to high operational costs.
- Underutilization of GPUs: Many AI deployments fail to utilize GPU memory effectively, resulting in hidden costs.
- Cloud Sprawl: Unmanaged virtual machines (VMs) and notebooks contribute to soaring expenses.
- Egress Costs: Transferring large AI data inputs between regions adds unexpected costs.
Strategies for Cost-Effective AI Deployments
To tackle these challenges, organizations can implement several key strategies:
1. Optimize Model Fine-Tuning
Using techniques like Low-Rank Adaptation (LoRa) can significantly reduce costs associated with fine-tuning large language models (LLMs). Instead of updating all parameters, focus on a small rank subspace for efficient learning.
2. Implement Quantization
Quantization involves using integers instead of floating-point numbers, reducing the memory footprint of AI models. This can lead to substantial cost savings:
- Memory Efficiency: Integers typically use 8 bytes, while floats consume 32 bytes, leading to reduced storage costs.
- Framework Compatibility: Frameworks like ONNX Runtime and TensorRT support quantization for enhanced performance.
3. Utilize Pruning Techniques
Pruning helps eliminate unnecessary weights from models, optimizing resource usage. Tools available in PyTorch and TensorFlow facilitate this process:
- Cost Reduction: Optimize performance without sacrificing model quality by removing unimportant parameters.
- Toolkit Availability: Leverage frameworks that provide built-in pruning functionalities.
4. Auto-Scaling with Cloud Native Tools
Scaling resources dynamically based on demand is crucial:
- Auto Scaling: Use Azure ML or KEDA (Kubernetes Event-Driven Autoscaling) to adjust resources and manage surges in demand efficiently.
- Monitor Performance: Regularly assess inference demands and user behavior to optimize scaling strategies.
Prioritizing Sustainability in AI Practices
Beyond cost, organizations must consider the ecological impact of their AI solutions. The following practices contribute to sustainable AI:
- Energy Efficiency: Opt for cloud providers that prioritize energy-aware deployments, reducing carbon footprints.
- Optimal Scheduling: Execute compute-heavy jobs during off-peak hours to minimize energy costs.
- Carbon Monitoring Tools: Utilize tools like "Cold Carbon" to track and optimize energy usage.
Key Takeaways for Effective AI Practices
To wrap up, organizations looking to optimize AI deployments should focus on:
- Utilizing cost-efficient techniques such as LoRa and quantization
- Implementing auto-scaling using cloud-native tools to match demand dynamically
- Considering environmental impact alongside operational costs
- Conducting regular benchmarks to avoid overengineering
Efficiency is invisible until it’s missing. Together, we can make AI not just intelligent but also efficient and ethical.
For further engagement on AI sustainability, feel free to connect with me on LinkedIn or drop me an email. Let’s pave the way for a smarter, greener future!
Video Transcription
And, welcome aboard. So this slide is a small slide of introduction about me.So I'm a solution architect at Microsoft, and, I'm a AI researcher and a multi cloud specialist. And I've done couple of certifications in all of the major clouds. And also, I've been trying to upscale myself in, some of the, technologies which are very trending right now and which are, changing the world of cloud, like Kubernetes and may it be anything on the Azure side or the Google side. So I've been trying to upscale myself in all of those. And also as the slide shows, I've been a researcher and I've been publishing some of my articles in Europe's and courts and GSCI. So I've been focusing on trying to create a scalable cost effective deployments of AI, and also those should be cloud optimized.
So I've been passionate about creating responsible AI and federated learning. So I've written some of the articles on this. So this is a little introduction about me. K. So, if you think about the edge, challenges, now I think everybody is now introduced into AI and everybody uses cloud. And now if you think about the challenges that we are facing, the one of the important challenges which we are facing right now is, like, we have AI, and we have been using AI in cloud. But it is too powerful, and it is it requires a lot of resources. And to have so many resources, it is very costly to use them. So that's one of the, challenges which we are facing today. So today's session's goals would be, like, where we'll be trying to find out if there is a way to have fast and efficient but cost conscious AI delivery, which we can, try to do and optimize the utilization of AI.
So that will be the one we'll be learning. And the other thing would be, like, I'll be trying to take you through the cost effective deployments in the AI and what will be the right way to use the auto scaling or optimizations and sustainability in the AI. So these are some of the things which I'll be taking you today. So, if you see the high cost problem of AI workloads, we see that the GPU is underutilized. It is a major hidden cost. Many of the AIs don't use maximum do do not use the maximum use of the GPU memory which is allotted, but still we block the full instance of it. So the other thing is, like, on the egress cost, we want to transfer all of the large AI inputs between the regions or services.
This will add silent cost, which will have silent billings happening. That is one of the things which is not taken care when or not thought about when we are trying to use AI. The other thing is the Cloud sprawl types where we see that ML engineers spin up some VMs and notebooks and forget to shut them down. So all this will try to build up the cost and we'll see that the Cloud costs skyrocket fast. We see people complaining that when we were using on prem, we didn't have so much of cost, but we see that when we are trying to use AI and the Cloud, we see the costs are skyrocketing and it's happening very fast. There are some of the, methods which we need to utilize so that there are no unmanaged VMs or there is a clear cut instructions given into what are the ways we can spin up VMs and close them so that this brings into optimization, which is essential.
But it can be complicated in certain ways, so there should be a clear cut line between those. So let us dive into some of the techniques which can be used for optimization. So LoRa is one of those techniques. LoRa is like low rank adaptation. Instead of updating the weights, we can update a small rank subspace and massive cost saving for fine tuning LLMs. I think most of us would have heard the term of fine tuning in LLMs. So here, we would be requiring to make sure that LLM is learning whatever is necessary. So instead of updating all of the weights for all of the parameters which have been loaded, we can update a small rank space in those whole set, and massive cost savings can be done when we are fine tuning elements.
That is the one thing where we can utilize. What is the quantization? Quantization is like instead of, maybe we are using int, we can use, int wherever we are using float. I think most of us would know in a programming languages that there are integers, floats, decimals, and binaries, everything. Right? So when we are trying to model weights, we can try to use utilize int instead of float because int utilized eight bytes and float is using 32 bytes. So we'll be quantizing the memory footprint there. And also, there are other runtimes which support this. Like, there are frameworks called o n n x, runtimes or they could be tensor RT to support this.
So utilizing such type of quantization techniques, we'll make sure that the cost is being cut. The other thing is, like, we could utilize pruning. Pruning is where we will be optimizing the unnecessary weights to stop the cost in the, unnecessary resources. What we could do is we could do TensorFlow model optimizations. There are some toolkits available in PyTorch which will, for example, there is something called as utils dot prune, which will help in good start and trying to make sure that we will be reducing cost and also will be, like, utilizing more of the benefits which are present. Like, we will optimize the cost without degrading the quality by utilizing such of the pruning techniques. One of the Azure tip I can give you guys is, like, there is in Azure ML, which is called, configure inference clusters with max and min nodes where we'll be setting the auto scale method to GPU so that the utilization or the resource, request count can be taken care. So there are certain tips and tricks which we can use to reduce the cost.
So, when we are trying to talk about auto scaling in air workloads, so what why do we need scaling when there are a lot of people coming into cloud and utilizing our product? That's when we need to make sure that our product is scaling and everybody can access it. So when we are scaling these things, so we wanna make sure that there is a monitoring happening and the inference demand. There is a hyperparameter tuning happening for every every, scaling that is happening. And also, we wanna make sure that batch size adjustments are happening. So, for example, if we know that only certain amount of number of people will be utilizing a certain part of functionality in a wrap, we must make sure that there is a certain batch size set. And also schedule traffic peaks when we know that, it is a holiday season and people will shop more or people will visit more to our website.
That's when we need to make sure that dynamic scaling is happening instead of, on the CPU or any other utilizations. Maybe we can have some scheduled traffic peak traffic peaks. The, key tools utilized here are, like, some of the things like we have in Azure is called ML auto scaling, where which it take care of all of the, it takes care of, how auto scaling is not just horizontal. It can be vertical. It can be adding more of the resource limits, or it can also be adding more of the replicas into it. So we can also use something called as KDA in Kubernetes, which is like Kubernetes event driven auto scaling, where it will customize AI jobs. And also it will trigger queues and influence requests or message process. So it can trigger all of this. It can be triggered.
AI jobs can be triggered by all of this. So we can use such things. And so, now we know that we need accuracy, we want speed, and we want cost. All three of this cannot be compromised. So when we are trying to do something like this, so we want some frameworks to help us. So there are some frameworks which will help us decide what is the accuracy, speed, and cost when we are trying to achieve all of them. So Optima is one type of a framework here. So where you can see for hyperparameter based tuning, the custom cost performance, trade offs are present. And there is something called as, hugging phase accelerate, so where we will speed up the training and reduce the memory overhead as well as which will reduce the cost in the, without the compromising on the performance.
So that is one of the things which, we can utilize. So we had seen a case, study in NeurIPS when I visited last year, in 02/2024, where Europe's happened in Vancouver, Canada. So in Europe's, a team found out that a small distilled model, I think it is Dizzle BERT, had 97% of the BERT accuracy with 40% less cost. So we want to achieve something like that where the accuracy and the speed are high and the cost is less. So finding such type of a sustainable practice is what we are trying to achieve. So now we got to know what are the trying what are the things we are trying to achieve and what is necessary to make sure that we have sustainable AI.
What are the what are the things we need to keep in mind when we are thinking about sustainable AI? So when we wanna make sure that our, AI is sustainable, we wanna show make sure that there is energy efficient reasons available. That means that some, cloud providers like Azure, they make sure that they don't expose the carbon intensity data, and also they make sure that the, use of carbon aware zones are the cloud zones have been preferred. So these are some of the measures the cloud team different types of cloud providers are providing. And, also, they are making sure that they're, running training jobs during low demand hours. So off peak scheduling is happening. So when something has to be run for long amount of time, they're making sure that that is happening during a less demand where people are either in different geos where they could be sleeping or where there is less demand.
They make sure that scheduling is happening in that part of the world. And, also, like, there is something, like, where they run workloads in zones with high renewable energies during those hours. So, there are some tools called core carbon, ML, c o two. There are other type of tools, so which will help us reduce the carbon content and also give us the, information onto what is, the level of carbon produced from all of this. And, also, like, I could say that, beyond the financial cost, there is also an ecological cost. Like, AI can have carbon footprint larger than some countries if it is not manageable. So that is how like, how can we prevent this?
So we need to make make sure that we choose carbon aware cloud regions, like Azure, GCP, offer carbon tracking dashboards where you can have all the details present on your dashboard. But, also, we must make sure that we schedule compute heavy jobs during peak grid off hours. And, as I mentioned, there are some tools like cold carbon to measure and optimize the energy you use and what is happening on your, site to it's visible in a dashboard for you. So we must make sure that sustainability and optimization go hand in hand. So what are the key takeaways in this AI Cloud optimization session? Let us recap all the key strategies shared during the session. First thing is we need to make sure that we use cost efficient deployments, like we can use LoRa, quantization, pruning, or lean models to make sure that we are using right type of cost efficient deployments.
The other thing is we need to use auto scale using cloud native tools. It could be something like KEDA or it could be Azure ML to match the demand which is happening dynamically. So we need to use in the built in platform tools. The other thing is, like, where we'll be, using the model optimization techniques. The pro the profile should deeply, match before the architecture choice is done, and the profile and benchmarking should be done before committing to the architectures. And also, we will also have to think about the environmental impact as it's a part of a responsible AI. We will have to think what is the energy usage. We're just not thinking about the dollars we're making.
We need to think about the energy usage that is happening and what is the impact it is creating on the environment or what is the carbon usage for it. And, also, we will have to think that we do regular benchmarkings where we'll be avoiding the over engineering or the wasted cost. And, also, remember, like, over engineering leads to underperformance, and keep benchmarking and keep fine tuning it. That's where you'll understand that there are certain things which can be done in an easier way than to take in a long route. So benchmarking will help us tune it and understand what is the right amount of things to do. So, I think, I could add this in a type of a framework where tune your model, track your usage, trim the excess, test the trade offs, and think sustainably.
These are some of the things I could think of. So, I think efficiency is invisible until it's missing. That's what we can say that we can make AI smarter, leaner, and greener together. So let's move to forward to a future where AI is not just not intelligent. It's efficient and ethical. And, feel free for you to connect with me on LinkedIn or send me a mail across if you are working on a similar problems or wanna collaborate on AI sustainability. And, thank you. Thank you so much. Anybody have any questions? Okay. I don't see any questions in the chat. Let me just see if there is something else in the chat. No?
And I hope you have understood some, beneficial things about, beyond the financial cost today, and, you have understood that there is an ecological cost for everything we do. AI can have carbon footprint. That is what we will have to take care of. Thank you. Thank you so much.
No comments so far – be the first to share your thoughts!