From Idea to Impact - Building GenAI solution by Narmatha Bala
Narmatha Bala
Software Engineering ManagerKavitha Govindarajalu
Principal Software EngineerReviews
Unlocking the Power of Generative AI: From Idea to Production
Welcome to our comprehensive guide on navigating the world of generative AI applications! In this blog post, we will explore key insights on how to transform innovative ideas into functional generative AI solutions. With the tech industry buzzing about AI advancements, understanding the journey from conception to production is vital for businesses looking to leverage this technology effectively.
Understanding Generative AI: The New Frontier
Generative AI has rapidly evolved into a powerful tool capable of revolutionizing various industries. However, its dual nature as both an overhyped and underappreciated technology complicates its integration into real-world applications. Why is this the case? Here's what you need to know:
- Rapid Development: Projects that once took months can now be developed in a matter of days or even weekends.
- Accessibility: Non-technical users can engage with AI tools like GPT to create functional prototypes.
- Probabilistic Nature: Generative AI operates on probabilities, meaning outputs can vary, raising challenges in reliability and consistency.
Challenges in Transitioning from Prototype to Production
While creating a generative AI model is exciting, the real challenge lies in scaling it to a production-ready state. Here are some hurdles that organizations encounter:
- Scalability: Can your solution handle increased demand effectively?
- Quality Assurance: How can you ensure the AI model produces trustworthy outputs consistently over time?
- Continuous Monitoring: AI models require ongoing assessment to track their performance and adapt to new data.
Starting Your Generative AI Journey
Before diving into technical complexities, it’s essential to define the problem your AI solution aims to solve. Here are pivotal steps to consider:
1. Define the Problem Statement
Every successful AI project starts with a clear problem statement. Ask yourself:
- Is AI the right solution? For simple tasks, traditional programming might suffice.
- Does my problem require understanding human complexities? If so, generative AI might be the best route.
2. Build a Prototype
During the prototyping stage, remember these key principles:
- Keep it Simple: Avoid over-engineering in the early phases.
- Focus on Learning: Test the AI’s ability to understand user intent and respond appropriately.
3. Establish Success Criteria
Defining what “success” looks like is crucial. Instead of vague metrics like "make it smarter," aim for specific performance indicators:
- User Satisfaction: Aim for at least 80% of users to find recommendations helpful.
- Response Accuracy: Set targets for accuracy and relevance in user interactions.
Building Responsibly: Ensuring Ethical AI Deployment
As you prepare for deployment, it is vital to embrace the principles of Responsible AI. Here are key considerations:
1. Fairness
Ensure that your AI does not unintentionally introduce biases into its outputs. Be proactive in identifying potential sources of bias in your training data.
2. Safety and Reliability
Design your AI system to fail safely. Establish fallback protocols and rigorous testing to manage unexpected behaviors.
3. Privacy and Security
Protect user data and maintain strict privacy standards throughout your AI’s lifecycle.
4. Inclusiveness
Make sure that diverse user groups can effectively interact with your AI system.
5. Transparency and Accountability
Be ready to explain your model's behavior and implement changes should issues arise.
Iterate and Evolve: The Continuous Improvement Cycle
As you move towards production, remember that building a generative AI application is not a one-and-done event. Engage with real users frequently to gather feedback:
- Prototype Testing: Release your AI to a small group of early adopters.
- Feedback Loops: Implement mechanisms for users to report helpful and unhelp
Video Transcription
Welcome to this talk. Today, both of us are gonna chat a little bit about the generative AI applications and how what it what it takes.My name is Narmata, and I'm an engineering manager at Microsoft working for industry solutions engineering team. Our teams work very closely with customers on real world problems. And over the last couple of years, we've been helping our customers solve some of their biggest problems using generative AI. With that, I'll pass this on to Kavitha.
Thank you, Nirmita. I'm also an engineering manager in Microsoft. Like Nanmata mentioned, we work with customers on Geni solutions, help them accelerate in their Geni journey. I'm also a mom of two teenage boys and a robotics coach. Super excited to be here to talk about Geni idea to production with you all. Let's jump right in. Have you ever noticed how every single app these days is suddenly AI powered? 10 AI has become the flashy new thing for sure. Right? Let's get real for a second. It is the most overhyped and underhyped technology at exactly the same time. Now what do I mean by that? Like, overhyped because people trust us like some digital article that we trust it more than they are supposed to. Right? And then underhyped because it's quietly rewriting every industry and transforming how we all work while we are also busy arguing whether it's gonna take over our jobs.
Now here's what actually blows my mind. An app that would have taken six months for the entire team to build can be built over the weekend. Now there are all the product managers, even, like, you know, nontechnical people talking to, AI like GPT and cloud and building these technical solutions. And all this is called wipe coding. Right? Where people talk to tell AI what to build, and then, hey. It figures out all these technical things and magically build the basic, you know, all the technical things. So, Namrata, what do
you think about all that? Thank you, Kavita. So today, Kavita and I are go both going to be sharing some of our experiences working on real world problems. Going back to byte coding, let's talk about what happens after the weekend hackathon is complete. Right? Building a cool demo is perhaps the first milestone of the marathon. The reality is sometimes taking an AI, not sometimes, all the time, taking an app from prototype to production is where most of these ideas stall. What does that mean? Let's break it down. You start with an idea. Let's say you have an idea of building a chatbot or perhaps a document summarizer or maybe even a smart assistant. It works great. You did the wipe coding. Over the weekend, you have these great ideas. It was able to work over the weekend on a thing. But let's move it into the real world. And that's when questions like, can it really scale? Can it handle edge cases? And what if something really goes wrong?
Or even things like, I got the same answer. I got the answer yesterday, which I really liked, but Monday morning when I run it, it's not exactly the same exact thing as the weekend. What happened? These are not small details, and these are always what the differences between trust and chaos. The Gen AI systems behave very differently from what a traditional software. Gen AI systems are very probabilistic. What does that mean? That means that the in the kind of input that you get today might produce a slightly different output tomorrow. It's because these LLMs are deterministic. So it makes testing, monitoring for changes, and even figuring out how do you know it's correct becomes a new challenge. And this is when reality is not just a QA checkbox, but there is a very high need for this to be an ongoing design philosophy.
And then let's talk about resilience. What happens when these models updates happen overnight and the behavior shifts very slightly, but that's not what you're expecting for? And when you have a system in production, these systems need safeguards in place. What are the what does the fallbacks look like? How do I know things are not looking the way it looked last week? How do I know these systems are not getting rate limited? And how do I even monitor how big my token sizes are? Those are real world problems. And let's not even go let's not forget about all of the privacy and compliance and the drifts that might happen as a part of the model, updates that happen, and the biases that exist. So while by coding is a great start, it does lower the barrier to entry.
But building something that you depend on takes a lot more of engineering rigor, grounding of what the model is, human context, and, frankly, teams who know how to ship things responsibly. We'll talk a lot more about what responsible AI means towards the end, but we are at this point where a combination of creative ceiling has lifted while there is a lot of opportunity for what it looks like. With that, Kavitha, I would love to hear from you on where do people start.
Absolutely. Well said, Narmatad. So where do people start? Like, this we always start all the even traditional software with the problem statement. Right? We miss that point sometime. Like, we're, you know, really happy to build that flashed demo demo to our executor and, like, this is the new thing. But, you know, that's the most important step, so it's important to talk about it. The first question we should all be asking is how like, you know, should we use AI for this particular problem and how? So if you're building a simple auto complete, simple rule based decision, a traditional solution is probably much faster, cheaper, and more reliable. Right? Gen AI shines when a problem involves a human complexity. Understanding nuance, handling ambiguity, or processing natural language, and all those things.
So what does a good example for a Gen AI problem look like? If you remember last time when you tried to find a product online, you probably spent half an hour digging through filters and whatnot and, you know, comparing specifications. And at the end of it, you're not even sure if you made the right choice, but, you know, yeah, you spent an hour looking through it. So when a customer asks, what's a good laptop for a graphic design and a thousand dollars? They are sharing their profession, implying the need for the color accuracy and processing power, suggesting a budget constraint, and asking a value judgment on what good means in this context. So this is precisely where and why we need AI, specifically Gen AI. Right?
It bridges the gap between how a human naturally communicate and how a computer process the information. So to summarize like, we need to calculate the true business impact, not just technical feasibility. You know, if how do you calculate that ROI check? The ROI check I would do is, am I making something marginally significant at a significant marginally better at a significant cost, or this is gonna really improve the customer experience and, you know, it's just producing real results. What are we gaining? I think that's a basic question we need to start with, the problem solved.
Thank you, Kavitha. That clearly defines why we need GenAI. Right? Some sometimes it's about moving away from the next shiny thing to build, but let's talk about what happens once the idea is on the table. For instance, you come up with a solid use case. Let's say an AI assistant that provides product recommendations based on conversational input. That seems like a no brainer. Right? That's a great idea. That's a wonderful, use case to start with. And then let you're thinking, let's go ahead. Build that prototype. Well, here's a first rule of Gen AI prototyping, which is do not over engineer in the beginning. Keep it small. It is okay for it to be scrappy, and let the user experience shape how the system needs to evolve and should evolve and not the other way around.
Because at this stage, the goal must be to learn fast. Can the model can the LLM understand what the intent of the user is? Can it respond with something that is useful, or is it just gonna spew out stuff that's totally relevant, that's not what the user is looking for? And as you start building systems based on these questions and then always engaging with the real users, because we're building applications for real world problems. And the sooner you bring in real users into pilot testing, getting some early feedback, and starting to understand not just what you would expect out of this, but also to understand where can things possibly go wrong. Right? So how do you approach all of these things? We talked about LLM being probabilistic, and we also talked about it being nondeterministic. And so with all of these constraints, how we still wanna be able to, see if it's doing the right thing.
So how do you start thinking about going about something like this? The very first thing is to start thinking about what does good look like? What is your success criteria? And not something vague like make it smarter or make a better recommendation system. Right? But let's be specific. And sometimes being specific saves a lot of pain further down the line. And so let's say in our example of doing a recommendation system, you may wanna think about something like 80% of the users say the recommendation was helpful, or it gives me responses that are price constraints for about roughly 90% of the time. And maybe this is the time for you to also identify what's the low min bar. Let's say, if it gives users only recommendation 50% of the time, is it still helpful?
And what needs to be done for it to get to that end goal of 80%? And these kinds of goals, it lets you know, a, if you're on the right track, because you're measuring them. Because you know what good looks like, you start measuring them. And when you start measuring them, it's not like measure it once and be done with it, but you also have this iterative approach of measuring them early and measuring them often. And this is exactly why Gen AI is different from traditional software development so that you get an indication of how you're making progress. So let's take a step back. Let's think about it. We have a good problem for us to go solve. We started building a prototype. We know what good looks like.
We know how to measure it because we've identified what success measures look like, and you quote unquote release it to the early users. Right? These early adopters always identify a small group of folks that are committed to knowing that this these responses are generated by AI, meaning they know for sure this is not, like, done said and done. And you start thinking about building feedback loops into the system and having a really good system so the users can flag both helpful, but more importantly, the unhelpful responses. Try to see if you're able to track where in the workflow there are moments of confusion. So let's say, for the recommendation system, it offers up five recommendations for the for the product, and then the users always pick only the third or the fourth or the fifth. It's likely that the top two are not relevant.
So that would be a good thing for you to start digging into the into the system to see what might be the the reason for that. Again, don't forget, human feedback, it's always the best thing. And if you're building for a user, getting some early feedback from these humans who are gonna be the the users of this app will be the best. So try to see how early and how often you can bring them in. Talk to them. Talk to see why it's working for them, why it's not. Because the idea of building a production phasing application is to make sure that we are grounded in real world context. So prototype, definitely, iterate, understand what is good to be measured, measure what matters, and learn in tight loops. And that's how you build an evaluation system on GenAI.
But here's a bigger question. When do you decide if we should ship it? And that's where I'd love to hear Kavitha talk a little bit more on, like, what are critical things to know as we're building a Gen AI system. Is it fair? Is it safe? Can we when we push things into real world, you don't wanna be in the papers for the wrong things. So having that confidence so that as a company, you're able to stand behind the product in the real world.
Yeah. So we have covered how to build and evaluate in AI systems. But before we wrap, let's pause and talk about something just as important, how to build responsibly. Responsible AI isn't just a Microsoft value. It's a practical framework for making sure we ship what we ship is fair, safe, and trustworthy. Let's take a look at these principles. Fairness, are we unintentionally biasing outcomes based on how the model was trained? Reliability and safety, what happens when a model is wrong? Does it fail safely? And privacy and security, are we protecting our user input and sensitive data? Inclusiveness, can people of different backgrounds, language size, or abilities use the system effectively? And under all that, transparency and accountability. Right? Can we explain what the model did, and are we prepared to take action if something goes wrong?
And all that, if we can explain, I think that is when we make sure that we ship our product responsibly. To do that, you need a solid monitoring system. Right? Where, you know, if you're going to explain what exactly the model that you're monitoring all that. And, like, Maramatha mentioned that how do we measuring like, the monitoring will give a lot of input into all that. And the second thing to go back to your point, Maramatha, the qualitative feedback and the human in the loop becomes super critical here. Even if somebody is building a completely automated, agenting system, the human in the loop for the evaluation and the qualitative feedback becomes super critical where you can track, you know, you can track the engagement accuracy with all the quantitative metrics. But then if someone is going to say, I don't trust that response or it fell off, that's a signal.
That's how you catch all these, you know, quantitative metrics and say, no. We need to go back and figure out, fix all this to make sure we ship responsibly. Right? So responsible AI isn't the final step. It's how we design, how we test, and how we respond when things don't go as expected. So it's integrated into every single step we do. Because shipping gen AI is not just about making it work, it's about making it right.
Absolutely. And shipping it responsible and making it right are tall promises. And the reality is, Jane AI is not a magic. Right? It can feel magical when it's done right, but there's a lot of thought that needs to go behind. How do we define problems, and what kind of experiments we run, and how do we know we're doing the right thing? What what are the our success measures? And, also, making sure that we ship responsibly. A call out to Microsoft's responsible AI framework, it gives a really good framework for one to think about what are the key what are the pillars, and what are the things that I need to be looking at? And, also, what how do I go about doing these? There are great, framework examples. I'm happy to post a link, right when we are done.
Also, we're riding a wave here, and it's moving fast. It's funny. I was joking about this earlier last week. You blink, and it's not like you missed a episode. It's like you've missed a season. Right? You're like, wait. What where was I in the last couple of days? And so many moving pieces. So one thing I'll leave you all with this thought. Ask the hard questions. Measure what matters, and don't ship until it's something that you trust with your name on it. And with that, I am sure you'll be able to go on to build amazing JNI applications in the real world. With that, we come to the close, and we are on time. Happy to take maybe a question or two. I do see one in the post.
As Gen AI moves into production, how do you ensure it creates real societal impact or social impact, especially for underrepresented communities? I can take a stab at it. One of the things that I would go back is thinking about building responsible AI systems. So if there is a way that it could go wrong, it definitely will go wrong. So testing, having those tests, anticipating what would go wrong and making sure the model responds to it the way you would expect and making sure the model does not respect does not respond in a way that you would expect will be the key. And, again, it's not a one time done deal. You would have to do this on a very periodic and iterative process. I do encourage to for you to take a look at, the Microsoft RAI principles framework. We'll be posting that in a minute.
So, hopefully, that answers some of your questions, Alejandra. To add
to that probably, like, do do you have a representative, representation in your end users as well as your input? Right? Something that's very important if you're really thinking about, that question already. We have a representative sample. That'd be a good question to ask or start with. Absolutely.
Awesome. It was wonderful, being here with you all today. I hope you enjoy the rest of the day listening to some inspiring conversations.
Thank you all.
Thank you, everyone. Bye.
Bye.
No comments so far – be the first to share your thoughts!