Building Gen AI workloads with AWS Serverless compute by Sowjanya Pandruju
Sowjanya Pandruju
Cloud Native Applications ArchitectReviews
Building Gen AI Workloads on Serverless Compute: An Overview
Welcome to our in-depth exploration of building Generative AI (Gen AI) workloads on serverless compute, inspired by a recent presentation from Soujunia Panruchu, a Cloud Application Architect at AWS, during the Women's Tech Global Conference. This article aims to shed light on the synergy between Gen AI and serverless computing, outlining various use cases and architectural patterns to enhance your understanding.
Understanding the Gen AI Ecosystem
Generative AI is a branch of artificial intelligence capable of creating new content—ranging from text and images to music and videos. With its growing prominence, companies can leverage Gen AI in several ways:
- Customer Experience: Virtual assistants, chatbots, and intelligent contact centers can transform customer interactions.
- Productivity Improvement: Capabilities such as conversational search and code generation can streamline operations.
- Business Operations Enhancement: Intelligent document processing and quality control can optimize backend operations.
At the core of Gen AI are foundation models—pretrained, large-scale machine learning models that can be fine-tuned for specific applications, often requiring less data and computational resources than traditional methods.
Key Personas in the Gen AI Ecosystem
During the discussion, three primary personas within the Gen AI ecosystem were highlighted:
- Model Consumers: These users prefer off-the-shelf AI products and focus on integration with existing workflows without heavy infrastructure management.
- Model Tuners: These businesses fine-tune foundation models for specific industry applications.
- Model Builders/Providers: Companies that develop their models from scratch or offer them as a service.
Why Utilize Serverless Compute for Gen AI?
Choosing a serverless architecture for Gen AI workloads offers significant advantages:
- Accelerated Development: Developers can focus more on innovation rather than managing infrastructure.
- Cost-Effectiveness: Serverless pricing is based on actual usage, eliminating costs associated with idle server time.
- Built-in High Availability: Serverless solutions automatically provide fault tolerance and scalability, which is crucial for unpredictable ML workloads.
Amazon Services for Gen AI
Key serverless services available on AWS for developing Gen AI applications include:
- Amazon SageMaker Jumpstart: A managed service that allows deployment, configuration, and hosting of models.
- Amazon Bedrock: A fully serverless option that enables users to invoke models without worrying about infrastructure management.
Emerging Patterns in Gen AI Applications
Below are some noteworthy use cases that illustrate how serverless architecture supports Gen AI workloads:
1. Retrieval Augmented Generation (RAG)
RAG enhances AI responses by retrieving relevant data to complement prompts. This approach can be effectively utilized in customer service settings, such as financial auditing. Using the Kendra chatbot solution, analysts can query financial documents easily, with relevant answers provided alongside source links for transparency.
2. Document Summarization
Utilizing large language models (LLMs) for document summarization can greatly reduce processing time. With an event-driven architecture, users can upload lengthy documents to a storage solution, triggering automated text extraction and summarization workflows.
3. Document Generation
Automating document creation, such as contracts and agreements, can significantly improve efficiency and reduce human error. A stable diffusion model can be employed for generating images alongside essential document texts, enhancing workflow capabilities.
4. Safe Image Generation
As image generation technology evolves, it's imperative to incorporate content moderation. Fine-tuning models can ensure that generated content aligns with community standards and optimizes user experiences.
5. Intelligent Document Processing (IDP)
IDP streamlines document workflows, reducing errors and improving data reliability. It can classify, extract, and enrich documents effectively, enhancing organizational productivity.
6. Automated Caption Creation
Creating textual descriptions for images can improve searchability and user experience. Using the Kendra search engine, users can perform natural language queries to retrieve specific images based on contextually relevant descriptions.
Conclusion
Leveraging serverless
Video Transcription
Okay. It's time. So, hi, everyone. I'm Soujunia Panruchu, a cloud application architect at AWS, and I'm very excited to be here today for the Women's Tech Global Conference.And today, I'm here to talk about, building Gen AI workloads on serverless compute. So let's get started. First things first, I wanna talk about our agenda. So here's our agenda. We'll first cup cover what, the Gen AI ecosystem is about, what is the synergy with serverless compute followed by, potential use cases. So let's dive in. GenAI ecosystem. So let's let's talk about this. So GenAI, as we know, is a type of AI that can, create new content and ideas, including conversations, stories, images, videos, music, whatnot. So, we we are saying, like, you know, you can use GenAI, for a wide range of use cases.
Like, you can, use GenAI to improve customer experience through capabilities like, virtual assistance, chatbots, intelligent contact centers. You can have content moderation. You can also boost your employees' productivity if you are a large company. With GenAI powered, search, you can improve the conversational search. You can do text summarization. You can do code generation. And you can also use Gen AI to turbocharge production of all types of creative content like art, music, images, video, animations, all of that. Finally, you can also use Gen AI to improve business operations. Like, you can have intelligent document processing. You can have, quality control and visual inspection. You can also generate, synthetic training data for your new business use cases. So as you can see, it has a wide range of, use cases. So, like all AI, GenAI is powered by machine learning models.
These are very large models, that are pretrained on vast amounts of data and commonly referred to as, foundation models. And we'll talk more about these, in the upcoming content as well. So how foundation models differ from other, machine learning models? And, also, we talk about, difference between Gen AI and traditional AI. So in the side by side comparison, I want to highlight that the big difference we see between, Gen AI and traditional AI is that, traditional models can only describe or predict something based on existing content. But GenAI has capability to create new content. With typical machine learning models, you gather label data, train a model, and deploy that model. But with foundation models, instead of gathering label data and training multiple models, you use the same pretrained foundation model to adapt to several tasks.
So it can also be customized to perform domain specific functions, that are differentiating to their businesses using only a very small fraction of the data and compute required to train a model from scratch. So it's a really big deal. Right? So in the ecosystem, in the Gen ai ecosystem, there are three types of personas, and our conversation today is also going to focus more on, certain type of these personas. So, the three types here we are discussing are the consumers, tuners, builders or providers. So what does this mean? A model consumers, prefer to buy AI products off the shelf and want to integrate, Gen AI capabilities into their applications and workflows as simply, accurately, and cost effectively as possible. They are looking for the easiest way to build and scale Gen AI applications.
These customers want to accelerate the development of their, Gen AI applications using pretrained foundation models, without having to manage any infrastructure. So that's how they are defining to the consumer's part. And now comes the model tuners. So these, tuners will fine tune or retrain, foundation models for specific use cases. These companies are typically SaaS businesses with models trained to particular, industry segment they serve. And finally, model builders. Builders are companies training their own large model from scratch with the intention only making that model available to their own internal teams. And model providers are companies who want to make their models available to other customers. These companies are usually aiming to offer their foundation models to other organizations, like, through APIs or through model as a service hubs or direct distribution to customers and stuff like that.
So for our conversation, we are going to target mainly on the consumer and the tuner, personas, here, and we are gonna look into the use cases, based on that. So let's dive in. And now we are talking about GenAI integration with serverless because that's the core of the topic that we are talking today. So for serverless, we will be focusing on, as I mentioned, the top two types of customers, model consumer and model tuner for faster time to market with managed services where model consumer use existing models and model tuners train existing models with domain data to generate a new model.
So, and when it comes to accessing foundation models, there are many services that are available with Amazon. And here I'm showing you the side by side comparison between, Amazon SageMaker Jumpstart and Bedrock. So what's the main difference between these two? These two are really amazing services in their own way. So the biggest difference that I want to highlight between these two is that, with SageMaker, you manage the deployment, configuration and hosting on the model in your application architecture. And with Bedrock, Bedrock is serverless where you just invoke the model API with parameters, and you don't have to manage the model, configuration or deployment. And each model that you get to pick, each has a different pricing, depends on the request tokens and stuff like that.
So, now the big question of the of the whole, talk here, like, why build GenAI on, AWS serverless compute? Like, what are we achieving here? So I wanna highlight that the speed of serverless compute services along with the power of Gen AI, it can expedite innovation, and that's what we are trying to achieve here. So we want to address the question of why serverless compute for Gen AI. Right? So here I have, explanation in this way. So with serverless compute, accelerated development, allows customers to focus on Gen AI applications with built in high availability, auto scaling, fault tolerance. All of it is built into that. It simplifies the building and management. And because of the latency sensitive and very unpredictable nature of ML workloads, using serverless compute, it alleviates teams from the complexity of infrastructure planning, infrastructure management so that they can focus on building the best applications to support their businesses.
And AWS provides a widest range of, purpose built serverless services, like from compute and storage to workflows and streaming and analytics that you can rapidly integrate and compose into serverless applications, accelerating your time to production. It has the most choices for customers to pick the right tool for the right job and compose them quickly into production grade applications. And also cost effective. Like, one of the major advantages of using serverless architecture to train ML models is the pricing structure. In traditional ML approach, the server was kept on, even when the model is not utilized. But when serverless architecture is used, it becomes easy to reduce the cost as the cost model of serverless is execution based. Finally, Gen AI is in its nascent stage with models evolving rapidly, and we have to provide a way to lower the barrier of entry and plug in and play model, which the customers can achieve with the serverless, compute architecture.
So with this, we, I hope we are we laid a foundation on why, serverless compute for GenAI. And there are a lot of, patterns in terms of application architectures that are enabled by serverless. You can support, microservices, data architectures, integration patterns, and all of this. It supports a wide range. Now that we covered the the Gen AI ecosystem and the question of why serverless for Gen AI workloads, now I want to dive into some emerging Gen AI patterns. These are some of the most used patterns or most demand, patterns, and I want to dive into the architectures through those patterns. So now let's take a look. So these are the use cases today I'm going to talk about in detail and how, the serverless architectures come into the picture, go into detail of the services that we are using, and so on.
So first, a high level understanding. The first use case that I'm bringing here is the retrieval augmented generation, RAG, mostly commonly known as RAG. It is a technique to retrieve data from outside a foundation model, to augment the prompts by injecting relevant, retrieved data into the context. The second use case we're talking about is the document summarization. So large language models are most commonly called LLMs, can be used to analyze, complex documents and provide, summaries and answers to questions. The third use case we're gonna talk about is the document generation. So document generation plays a crucial role in streamlining business operations and enhancing productivity. Whether you need to create contracts or agreements, invoices, or, any other, important documents of that in that manner.
So you can harness the power of document generation tool, and it can revolutionize your workflow. So pretty, great use cases, and that's the reason they are, right now, emerging patterns as well. So, we talked about two types of personas that we're gonna talk about today, right, from the model consumer and the model tuner. So for each persona, I have three different use cases that we are have highlighted as emerging patterns, and we're gonna dive into, that architecture and how it looks like with, AWS services, especially the whole serverless architecture, how it, binds into the context. So first things first. The first use case we have is the retrieval augmented generation rack. So here, I'm showing you a sample on how that's look looking like. So for enterprise use cases, the insights must be generated based on enterprise content to keep the answers in domain and to mitigate hallucinations using the rag approach.
So in this application, we will be using the Kendra chatbot solution so that, financial analysts and auditors can interact with their enterprise data to find reliable answers to audit related questions. Kendra chatbot provides answers along with source links and has the capability to summarize long answers. So when you provide source link when with your answer, it creates a transparency, and it earns that trust, with the answers that you get from the AI. So pretty powerful. Right? So let's dive into the architecture. Like, how are we implementing this? So in this pattern that I brought here today, this is the flow that that we are going to, provide look into. So the financial documents that we talked about and any agreements of that sort are all stored on Amazon s three and ingested to a Amazon Kendra index using the s three data source connector.
The LLM is hosted on a SageMaker endpoint, and Amazon Lex chatbot is used to interact with the user via the Lex WebUI. This solution uses AWS Lambda function with LandChain to orchestrate between, Kendra, Lex, and the LLM. When the user asks the Amazon Lex chatbot for answers, let's say from a financial document, Amazon Lex calls the land chain orchestrator to fulfill that request. Based on the query, the land chain orchestrator pulls the relevant financial records and paragraphs from Kendra. The land chain, orchestrator provides these relevant records, to the LLM along with query and relevant prompt to carry out the, required activity. The LLM processes the request from the land chain orchestrator and returns the result.
And finally, the land chain orchestrator gets the result from the LLM and sends it to the end user through the, like, chatbot that we have. So this is how the whole flow, looks like. So that was our first use case. So now let's dive into the second use case we're talking about, the document, summarization. So in this pattern, this is demonstrating how you can construct a real time user interface to let business users process, let's say, PDF document of arbitrary length. We have been using financial statements as an example. Right? So so let's go with that. Financial statements like, you know, like quarterly earnings reports or annual reports to shareholders, they can be very long, often tens or hundreds of pages long.
And these documents often contain a lot of, boilerplate language, like, a lot of legal language, like disclaimers and stuff. If you want to extract key data points from one of these documents, you need both time and some familiarity with the boilerplate language so you can identify the interesting facts. So here is the event driven architecture we have come up with for this kind of use cases. So the front end application lets users upload PDF documents to Amazon s three. After the upload is complete, you can trigger a text extraction job, powered by Amazon Textract. As part of the processing, an AWS Lambda function inserts special markers into the text indicating, like, page boundaries and stuff. When the job is done, you can invoke an API that summarizes a text or answers questions about it.
Because some of these steps may take some time, the architecture uses a decoupled asynchronous approach. That's why we call it as even driven approach. For example, the call to summarize a document invokes a Lambda function that posts a message to an Amazon SQS queue. Another Lambda function picks up. After that, it's that message and starts an, ECS Fargate task. The Fargate task calls SageMaker endpoint. Here we used a Fargate task instead of a Lambda function because, summarizing a very long PDF may take more time and memory than, a lambda function can support. So when the summarization is done, the front end application can pick up the results from the DynamoDB table. So this is how the whole architecture binds together for a solution. And the third use case we have is a document generation that we talked about.
And here we are talking about, image generation, as a sample. So let's, quickly take a look. And here we are using our we are using a stable diffusion model for the image generation, and the web application is built on a Streamlit. It's an open source Python library that makes it easy to create, and share, like, beautiful, custom web apps, for machine learning. And we host this web application using ECS with Fargate, and it is accessed via, ALB, the application load balancer. The Gen AI model, endpoints are launched from jump start, images and stored in the Amazon ECRP elastic container registry. So model data is stored on s three in the gem start account.
So the web application interacts with the models via Amazon API gateway and Lambda functions, as we can see in the, architecture here. So model tuner use cases. So until now, we have been talking about model consumer use cases. Now let's dive into model tuner use cases. The first one is the safe image generation. So as we talked about, right, GenAI technology is improving rapidly, and it's now possible to generate text and images based on, text input. And customers, using them from image generation must prioritize content moderation to protect their users, their platform, and brand, also by, implementing some sort of, strong moderation practices so that they create a safe and positive user experience while safeguarding the platform and brand reputation.
So, that's why fine tuning is a common technique used to adapt pretrained models to specific tasks. In the case of stable diffusion, fine tuning can be used to generate images that incorporate specific objects, styles, and characters. And as I said, right, content moderation is critical when training your stable diffusion model so that you prevent the creation of inappropriate or, offensive images. So you, use patterns like sequential pattern to moderate these text and images. A rule based function and Amazon comprehend are called for text moderation, and, we use, Amazon recognition for image moderation both before and after invoking stable diffusion. So an AWS Lambda function coordinates image generation and moderation using Comprehend and, recognition.
The restful API, as you see, will actually return the image generated and the moderation warnings to the client if any, unsafe information is detected through this process. So, now let's look at the second use case in this, tuner persona, which is, intelligent document processing. Like, this is really, amazing technique in its own way. So, data classification, extraction, and analysis, they can be very challenging for organization that deal with volumes of documents. And GenAI complements, Amazon Textract to help with automating document processing workflows. Features like normalizing key fields and summarizing input data, support faster cycles, for managing document process workflows while also reducing, potential for errors. So the intelligent document processing or IDP, let's make it simple. Let's call it IDP. It's comprised of three stages like we just talked about, classification, extraction, and enrichment.
So in the classification stage, foundation models can now classify documents without any additional training. This means that, documents can be categorized even if the model hasn't seen, similar examples before. Foundation models in the extraction stage normalize date fields and verify addresses and phone numbers and stuff while ensuring consistent formatting. And foundation models in the enrichment stage allow inference logical reasoning and summarization. So when you use foundation models in each of these IDP stages, your workflow will be more streamlined and the performance will improve. So serverless services, help provide the mechanism to build a solution for IDP quickly.
Services such as, Lambda functions, step functions, and Amazon even bridge can help build the document processing pipeline with the integration of this foundation models. Generating these summaries using IDP with, serverless implementation at scale, helps organization get meaningful, concise, and, presentable data in a cost effective way. So now let's look at our, third use case or the final use case for the tuner persona, the automated caption creation. So Gen AI model can be used to generate, a textual description for the following image that I have here. It can say something like, a dog laying on the ground under an umbrella during document ingestion of the image. So users searching for terms like dog or umbrella will then be able to find the image as shown in the screenshot. So let's talk about the solution.
So in this solution, we are taking, these series of steps. We are first uploading images to an image repository like the s three bucket. The s three is then indexed by Kendra, which is a search engine that can be used to search for structured and unstructured data. During indexing, the GenAI model as well as, Amazon Textract are invoked to generate the image metadata. You can then search for images using, natural language queries such as, find images of red roses or you can say, show me pictures of dogs playing in the park. You can do that through the Kendra console. There is an SDK available or you can make it through API calls. So these queries are processed by, Amazon Kendra, which uses machine learning algorithms to understand, the meaning, behind the queries and retrieve, relevant images from the index repository.
The search results are then presented to you along with their corresponding, textual descriptions, allowing you to, quickly and easily find the images that you're looking for. So this is how the whole architecture, comes together. So, yeah, we have looked at, multiple, use cases for, the consumer persona and the tuner personas, and we established how serverless compute enables the foundation models to complete these tasks, like defining instructions and orchestration, configuring the f foundation model to access, company data sources and writing more custom code to execute these steps through a series of API calls.
So finally, we can say that developers can leverage serverless compute, to build and deploy and host these foundation models powered applications. So in a sense, all of these steps, you know, generally can take weeks, and serverless compute can definitely help accelerate the timeline. So that's what we are going for here. So, I think that's the end, of my presentation.
No comments so far – be the first to share your thoughts!