Addressing machine learning issues with Responsible AI

Ruth Yakubu
Principal Cloud Advocate
Automatic Summary

Understanding and Implementing Responsible AI: A Comprehensive Guide

Introduction to Responsible AI

In a world where Artificial Intelligence (AI) is becoming increasingly ingrained in our daily lives, the concept of responsible AI is gaining paramount importance. AI systems are shaping decisions in healthcare, finance, transportation, and more, which is why it's crucial to ensure these systems are fair, inclusive, and transparent. This article, drawing insights from an expert in the field, Ruth Yakubu, will discuss what responsible AI involves, why it's necessary, and how developers can address potential issues within AI models.

What is Responsible AI?

AI, a driver of modern innovation, is evolving rapidly. Cutting-edge advancements like OpenAI's chatbots have sparked widespread discussion. In this excitement, it's essential that AI systems are developed responsibly. Responsible AI encompasses core principles to ensure AI solutions are:

  • Fair: making equitable decisions
  • Reliable: performing well under various scenarios
  • Private and Secure: respecting individual privacy rights
  • Inclusive: considering diverse populations, including those with disabilities
  • Transparent: providing clarity into how decisions are made
  • Accountable: upholding creators to a standard of responsibility

Why Responsible AI Matters

As AI's reach extends, public scrutiny grows alongside its capabilities. Issues like AI-generated deepfakes present new ethical risks. Furthermore, government regulations are evolving, with calls for stricter AI oversight. This leads to an increased demand for AI systems that are not just technically proficient but also ethically sound and socially responsible.

The Challenges of Responsible AI

Data scientists and AI developers face challenges such as:

  • Inadequate tooling to ensure fairness within AI models
  • Models acting as 'black boxes' with obscure decision-making processes
  • Difficulty in diagnosing model errors, especially when traditional mathematical metrics don't reveal the human impact

Debugging AI Models with Responsible AI Toolkits

Thankfully, organizations like Fair Learn and InterpretML have been working on creating open-source solutions to assist data scientists in uncovering issues in AI models that might contravene responsible AI principles. Microsoft, for example, has contributed by compiling these toolkits into an accessible format such as the Responsible AI Dashboard toolkit.

How to Use Responsible AI Dashboard

This comprehensive toolkit allows for multifaceted analysis, including:

  • Error analysis: Identifying parts of a model with high error rates
  • Data exploration: Examining data distributions for representation biases
  • Fairness assessment: Assessing model decisions across different demographics
  • Model interpretation: Explaining AI decisions to users and stakeholders
  • Counterfactuals: Exploring how changes in input can alter model predictions
  • Causal analysis: Evaluating the effects of altering a feature without extensive A/B testing

Practical Application: A Case Study

In Ruth's demonstration, a responsible AI widget was used to debug a classification model purposed to predict hospital readmission for diabetic patients. This widget highlighted areas wherein the model underperformed, potentially indicating data imbalance or bias. It evaluated features like age, gender, prior hospitalizations, medication numbers, and race. Furthermore, it offered counterfactuals to understand the conditions for different model outcomes, promoting transparency and aiding decision-making.

Get Started with Responsible AI

For those interested in implementing these insights within their own models, there are numerous resources available:

Conclusion

Responsible AI is not just a technical requirement but a moral imperative. As AI continues to revolutionize industries, it's our duty to ensure that the technology remains fair, accountable, and inclusive. By leveraging tools such as the Responsible AI Dashboard and embracing the principles of responsible AI, developers can create technology that respects and enhances our society.

For more detailed instructions on how to use the Responsible AI Dashboard, and to better understand how to integrate responsible AI practices into your data science workflow, be sure to explore the links provided and consider reaching out to experts like Ruth Yakubu for workshops and further guidance.


Video Transcription

Hi, everyone. My name is Ruth Yakubu and thank you for joining this session on um how to address machine learning um issues with responsible A I. My name is Ruth Yubo.And in this session, we're gonna be going through a high level um understanding of uh what responsible A I is. Um Why is it important then um some of the gaps that we um data scientists or um A I developers are currently facing when we have to train a model and make sure it's responsible or not. Um Then lastly, I'll be doing a demo with the real life uh model and show you how to go about um debugging a model to make sure it is um more responsible A I, it's covering all the different uh areas within uh responsible A I. So let's go get started. OK. So um as you're all aware of A I is basically infusing our lives, uh It's part of our lives is infusing everything that we do from our devices to web browsers that we use, everything that uh we engage in um or engage with.

Um There's some form of uh A I intelligence behind it So with this new phenomenon, um we are seeing a lot of rapid innovations when it comes to A I just um this year, um unless you're living under the rock, um there's been a ton of um announcements uh within a A I uh a lot of cutting edge um breakthrough that have come in a A um A I, so you've heard of Open A I which is a, a buzz right now.

Everybody is talking about that. You hear about uh chat GP T and so much more. So all of this is exciting and other uh important innovations in A I that are important um that um we're currently seeing, another thing is that companies are investing a lot of um do a lot of investments in terms of uh adapting A I into their products, the business processes um in order to be more efficient with um the work they do and also have a, a competitive advantage.

Another area is around um social expectation. So with all of this uh uh innovations um that have to impact uh individuals and societies, um society uh expectation is also evolving as um technology A I, technology starts um be uh is getting more and more advanced and more and more mature and more and more um competitive.

Then another area that we're also seeing is governments are starting to uh regulate um A I. So there are certain industries like financial or health care where A I is um they have, you have certain guidelines when dealing with A I, but still people feel like that's not enough. Um, let's say Congress or heads of states um need to step in and regulate A I uh a little more so similar to how I mentioned about um societal expectations growing um in the news, whether it's on a weekly basis or a daily basis, we constantly hear about how A I is used or misused.

So there's a lot of uh public scrutiny around A I uh A I is doing a lot of amazing, great things. But um there are areas that um are questionable that people um are scrutinizing. Um and um probably have a negative connotation when it comes to um A I or some um misinterpretation of um what A I is going to do um to, to our society. So before we get into the details, one thing I want to cover um established first is we hear ethical A I, we hear a responsible A I um the term is used interchangeably. Um But what exactly is it? So I'm gonna use Microsoft's uh A I principles. These are the core principles of responsible A I that the company came up with for our internal use in every product A I, products and services that we build or create. Um Our engineering and research teams have to adhere to um these different uh areas with their respon. Uh For instance, making sure um the services are fair, um they're reliable. Um So for example, there are a billion people with uh disabilities. So, um well, the wrong uh analogy. So when it comes to reliability is more of a looking at the most edge case that you can potentially think of, um I'll think of uh a scenario like I'm using smart cars.

So that's what's the worst possible uh test that you can and um test on smart cars to make sure they are safe. So are you testing them late at night? Are you testing them when there are kids around that sort of thing? So how reliable um is the A I um solution and how safety it is? Then of course, uh when it comes to privacy and security, what kind of day are we using? Are we respecting people's privacy? Um with A I then inclusiveness, that's what I was trying to lead to inclusiveness. Um can mean a lot of things like um making opportunities available to um people um that have the same similar characteristics. Um And also another good example is um there are billion um people with disabilities out there. Um So when we're implementing the solutions, are we taking into account the different um disabilities um that people have? So these can be um things like dyslexia color blind list.

The list goes on then transparency. How are we showing transparency? I mentioned um regulations being out there. So if you're being audited. How do you show transparency um explanation and also uh understanding of why a model came up with a certain outcome. Then lastly, um for people that develop A I systems and solutions, how are we holding ourselves accountable? So that's um when we're talking about responsible A I, those are the core areas that uh we are referring to. Then the next question is OK, why do we need responsible A I? Um we are trying to do our best to make sure um A I is not harmful but why do we need a responsible A I. So one thing that um we as a society has uh observed is A I is continuously missing expectations. That's number one, number two, with all these breakthroughs that are coming out, it is exposing more or um challenges that we need to take into account. So a, a prime example is um let's say with the A I innovation, one thing that's new, that's a very big threat is uh deep fake. So you can uh uh A I can generate a photograph or video of somebody doing something even though it's false. But with the human eye, it looks like it's true or generate a voice that sounds like you or n saying certain things which uh it's not true.

So those are new things as we're innovating, we have to take into account new risk and new challenges are coming out there. Then with government uh regulations, more and more. There's a call for more and more government involvement and we're seeing how government is starting to uh approach the big tech companies and hold them accountable. Um For A I, so I'm sure all of us are uh or all of you are wondering. Ok, so yes, we agree. Um A I is good, but how does this have to do with me? So, number one, when we're building our machine learning models, the problem is, yes, we do come up with the models. We have the traditional metrics of um checking how accurate uh how many accurate um predictions are model made. Um How we, we're able to bring down um the errors um in the inaccuracies and whatnot. Um But that's a calculation. It really doesn't prove how your model is fair or not. So that's number one, we don't have enough tooling. Number two, a lot of A I solutions um decisions are made out of them. So how do we enable decision makers that rely on A I to make faster decisions and more confident decisions based upon the A I um solutions that they have or um the products that they have that at the end of the day. Um end users are skeptical.

So how can we gain their trust um based upon the more transparency we show to them and also showing them that these uh models have their best interests in mind. So, one of the things that um I wanted to share is when we're thinking of um debugging a model after we train it. Um some of the popular ways of going about uh debugging. A model's performance is looking at things like accuracy, uh recall. Um If you're dealing with a classification model, if you're dealing with regression, we look at things like uh mean squared error. Um We look at uh the mean absolute error and root square, all of those are different error um metrics that are good, but they're not sufficient when uh alone when it comes to responsible A I. So those are the gaps um that are there, they're heavily mathematical, but it doesn't show the uh human aspect of uh how it's impacting uh people or society. Um The next challenge is around um errors. So similar to the previous slide where I mentioned that, hey, we're looking at accuracy. So we'll say, hey, this model is 89% accurate. It looks great. Let's go home.

But realistically within that model, there could be areas um certain um pockets demographic of data where the model is actually not performing well. So let's say um this bracket um represents uh let's say a single mom and a as a model that predicts um what's it called loan approval. So if um when it comes to somebody, um that's a single mom, maybe the model is not giving loans to single moms even though they may have a great job and whatnot. The credit score is awesome compared to somebody um that is in a two household uh income. Um They may be um the model may not be favorable to um um this type of demographic. So the question is what tools are out there for us to be able, even though we see a rosy outcome here, how can we go under the hood and expose areas where this model is not, is not performing well. Then another challenge we face is um A I models are black bots, we do not know. And half of the time um what are the key features that are driving the prediction that's number one. then um being able to explain, especially when um a model makes a mistake.

How do you figure out why it made a mistake? Um Also for transparency, um when you're being audited, you need to show things um be transparent in terms of how you went about making a decision, like given out alone or denying somebody alone, you need to show evidence, uh reasons why or why did you diagnose a patient with a certain disorder, that sort of thing?

So these are the type of challenges that we face out there. So the good things is um since this is a huge gap, this is a huge problem that data scientists, organizations and A I developers are facing and they all understand the importance of a responsible A I. Um there have been a lot of organizations, for example, fair learn and interpret ML um that have created um uh open source solutions to help um data scientists to be able to expose and analyze and assess and identify some of the issues within their model that could be violating some of the uh responsible A I principles.

Um One thing to note is um all of these are coming from organizations, researchers, Microsoft, researchers uh around the world. Um So Microsoft is also instrumental in coming up with some of the open source tooling. Um For example, we have uh econ m um that sort of thing. But all of these are examples of mature um open source tooling that are out there. The problem is as a data scientist or as a A I developer, you have your notebook, you trained your model, everything is in one place. But if you're to use um these libraries, you're gonna have to be using them um separately. So maybe there's one notebook for fair learn and something um different, another library for interpret ML. So um even though you're getting the job done is a little tedious. Um so one of the things that Microsoft has decided to do is package all the mature uh open source tooling that out there and make it accessible. So one of the tools um that came out is the responsible A I dashboard um tool kit where um anybody um users can utilize and it's a one stop shop. It's a holistic interactive um tooling that um developers or data scientists can be able to debug and identify issues um with their model. So that's number one and the same technology is also infused uh or uh incorporated as a feature within Azure machine learning.

Let me check the time. OK. Looks like we have about 15 minutes. So though that's what um Microsoft is uh currently doing What in terms of pioneering and enabling engineers to be able to incorporate um this into their machine learning life cycle. Um So similar to um how we do with um software applications um debugging um For uh for the first time, we have a great way to um debug machine learning models. Um So just on a high level, um one of the things that uh is in the responsible A I dashboard is it gives you an ability to do error analysis. So remember how I mentioned you may have 89% accuracy, but there are different pockets uh distributed in your model uh areas where your model is uh failing. Um That's what area analysis does you can do. Also data exploration to see how data is distributed. Um Whether it's overrepresented or underrepresented, you can get a holistic view of your model and see the um discrepancies when it comes to some of these traditional metrics that you're used to using when um doing performance analysis on your model fairness assessment um is also there um auto interpret.

Um So being able to explain your A I solution then kind of factual. Um That's more of uh if you have a model and you want to see, OK. Um How can I get a desired outcome um from it made a prediction? But the desired outcome I want is this or you manipulate um the features just to see how the model will react. So a prime example is you deny somebody alone, they ask you, OK, what um have I done, what can I do to get along? This is a prime example, why, where you can show them options? Like OK, if you uh increase your credit score by, let's say 100 points or 50 points, you'll be able to get this loan or if you increase your salary by 28,000, you, this model will be able to approve you. So it brings a little another layer of transparency and it also helps decision makers make decisions. So that's also under falls under decision making for the uh business uh decision makers, then we also have causal analysis. So if you want to strategize on what you um actions you want to take, but you don't have the enough resource to do A B testing, this is a great way um to analyze how you can change one feature and the impact it will have on others.

So now um I'd like to show you guys uh an example, a practical example of how this works. So give me a few minutes um to bring up um the notebook. Um So can you guys see the new screen? Somebody? Give me a yes or no. OK. Awesome. Awesome. Oh, wow. There's a lot of people in the chat. I didn't realize that. So um this um notebook um basically what is doing the use cases, we have hospital data. So the model is being trained as a classification mo model. So it's gonna predict whether a diabetic patient um is gonna be readmitted back to a hospital uh within 30 days or not. Um So that's what we're trying to predict. Um So similar to your um program, it will look the same you bring in the data, you look at the different characteristics of your data. So in our case, we have things like age, gender, uh race, um how they came in like, um where do they come from uh emergency? Where would they uh discharge to, do they go home? Do they go to a system living, that sort of thing? Um The type of medications they're on um their insulin level, the A one C results. Um So very detailed analysis of their health history, their uh personal information.

Next thing is the typical, um you have your code um that you're training, so you train however uh way you want um whatever the business use case that you're working with So after training, the model, the next thing is, oh, you do have a model now. Um So how does this apply to the dashboard? How do you hand it over to the dashboard to do debugging? So one thing we can pause is the example I'm showing you right now is the open source version. So this is a responsible A I dashboard tool box where you can easily utilize, just go right now and go to um grab the light plug it in and you'll be able to see the dashboard with your unique um training. So I know a lot of you are gonna be excited after this uh demo to go try on your model to see what uh happens. So the very first thing um you need is the responsible A I um um you need the widgets. So this is what actually creates the visual and the interactive ness of your responsible A I dashboard which you see in a few minutes. The next thing is the re responsible A I insights. Those are actually the ones grab the information after um evaluates your model. It's gonna extract all the information like all the errors that it found. Um What are the key features driving your model's decision, that sort of thing. So that's what we mean by the insight.

So when you grab it, um the river first say, OK, specify which columns are numeric or categorical. It's just for it to know um how to maneuver around the data. Um Then you pass the model that just trained, you pass the, your training data set, your test data set. This is a classification um use case. So OK, what was the target feature? Um Then is a classification or regression? Use case you pass that along um into the class um uh to stati the class. Then the next thing that you're adding is you can select what you want to do analysis on. So for me, I'm saying, OK, I want to look at explainer. So that's the one that does uh interpret. I want to look at the error analysis. I want to look at counterfactual, I want to look at causal. So you specify the different components that you want. Um when you do a compute, that's actually what is going through with your data. Um and evaluating your model um based upon your train um based upon your data is evaluating your model. This one is optional. But yeah, if you want to create certain cohorts that um you want to look at um you can manually create it, but you can easily interactively um create it on the dashboard and we can go into details of how cohorts work.

So once you run that, what it does is OK, hang up, it generates a very beautiful dashboard for you. So remember how our use case is dealing with diabetic patients. Um One thing I love about the error analysis is its K key mission is for you to identify where um their errors, where is the high error rate. So, um, the dashboard gives you very good visual aids to locate where the main, the big problems are. So the darker the red, um the more the higher, uh the higher the error rate, um the lesser the re red is um, the lower the error rate. And if it's great, that means there's very low error rate. So as a tip, um, when you're, um, starting instead of going on a scavenger hunt and figure out, OK, which um, area is my model not working. Um The very first thing you need to assess is look at the root node. The root node is telling you that you have 994 total um test data. So this is a total test data that you have out of that. 100 and 80 you had 100 and 80 incorrect predictions. So it's gonna break it down. Um, the different uh features. So if you highlight, um, you can see for each node, um what was incorrect and what was inc uh not correct. If you double click on the node, you'll be able to see the actual feature.

So, and for the bugging, what we're trying to do is uh identify the paths that have the highest error rate. So for some reason, my computer is not cooperating. But basically, if you find the root node that has um the highest error rate, you just double click on it. It's gonna highlight um the path. Um And the good thing is you can click on save. Hm Yeah, for some reason, um my screen is froze. Um Yeah, so we can continue. Um I'll get back to this, but basically when you um highlight it, you can click on save cohort. So that's the path with the highest error rate. You can do the same for the least um uh error rate, uh path. And the good thing about that is not only do you want to see where the model is performing worse, you also may wanna also compare why is the model performing so well within that cohort? Um So yeah, when you click on save, it'll show all the different features that um you selected um and be able to save it. Um For some reason, my dashboard is acting up. Um So give me a second, let's see what is going on. Yeah, it's refresh. OK. Um So just remember the main um purpose of this is for it to just uh identify areas where um they're high uh error rates. So you save the path. Um Give it a unique name. Um The next thing is um it also shows you like a cheat sheet.

These are the top features that uh from top to bottom um where your errors are coming from. So all of this is to identify um your errors. Um Another good thing is you can also select a certain cohort. So let's say you saw, saw a feature, you wanna see how much is impacting the error um you can highlight. So I selected age, I wanna take a look at age and figure out what the error rate is. So even though the error rate for age is uh on the small side is only 20% the um one thing that stands out is the air coverage is 73.33%. So that means the coverage is remember the top root all the errors that came uh in 100 and 92 errors in total that uh the dashboard found from evaluating the model, that means 83% of the errors coming out of um patients that are over 90 this cohort of patients are contributing to that error.

So that's another um uh reason why you may wanna select that and save it as a cohort to investigate what is going on there. So that's an example of how to utilize that. Um So the good thing is um in the code, um we did create uh different cohorts. Um So we did one for age group and to compare um patients that had a prior hospitalization. So if they had a prior hospitalization um greater than one time, um this is a cohort that we collected if they did not have this is a cohort that we create um collected. So from there automatically, um the dashboard is gonna create a cohort called all data. So it has all the data. Um Then this is a cohort um that for patients that um did not have a prior um uh hospitalization. But from there, you can see the sample side like how much data is in those cohorts and also the performance, how were they performing? Um And we can see that the cohort with um prior hospitalization greater than one is problematic, problematic. The uh um the model is not performing very well. So you can also see things like uh the overall the false error, the false positive rate is very low.

So that means is it's in incorrectly predicting people patients that are not gonna be readmitted back to the hospital, but doing so at a very small rate. So it's not that much pro problematic. But when we look at the false, it's very high. So that means in general, it is falsely predicting that models uh um uh patients are not gonna be readmitted back to the hospital even though realistically they're gonna be uh readmitted. So probability distribution, you can also look um for the cohort that you created. What is the pro probability of them being readmitted? And you can play around, look at readmitted. Uh If you want to see the ones that um are remitted, what's the probability of these cohorts? Um you can do the same thing. We're looking at the accuracy scores, it looks amazing. But when you go to, let's say precision, you start seeing that OK, it's not looking good. So these are reasons why you need to get a holistic understanding and compare uh the different cohorts to see where the model is performing well and it not performing very well. You also have the luxury also looking at the Confucian Me Metric. Um Another good feature that um it provides is uh the model overview. So you can do feature based uh analysis. So for prior uh inpatient, if you wanna isolate some of those uh fields, um you can select it.

And the good thing is the dashboard will automatically um partition your data into uh reasonable cohorts and also show you for the different cohorts, the sample size like how many patients fill within that um given condition and the accuracy score. So you can see where, where um the model is really struggling. So those are the type of information that you can get. Another area is also looking at your data because um data is an area that um is a blind side that a lot of people don't pay that much attention to. Um So in our case, let's look at uh compare the true count of how many patients were readmitted versus not remitted. You can see a data imbalance. Um And it shows that um um there are less patients being uh readmitted, which realistically in real life, this is actually a good scenario. But um uh when it comes to prediction, you can see that the model is not learning. Well, um it is predicting more of the not reit it versus the remitted because it doesn't have enough data. So there's an imbalance. Another thing, you can also start looking at what are the sensitive features out there.

So look at the distribution between, let's say um the features like race, you can see there's a high population of Caucasians. Next, the next one is African Americans. So this is also a warning sign that your predictions are potentially gonna be skewed. Um If you um if you have this imbalance of data, because let's say you take your model to a demographic that is heavily uh African American or Hispanic, then your model is gonna incorrectly make predictions. Another thing you can also do is OK, let's look at gender. How is gender doing? Um Gender between male and female is right about um the same. So it doesn't look too problematic. Um Another area is also look at age, how does the age distribution look? And sure enough um the age uh when you look, um there's more concentration of uh people, older patients older than 60 years old. So this may be a normal um thing that um health practitioners see. So it's very good for machine learning professionals to work very closely with the um specialists to understand um what is real um and how to better um represent, make sure um we're not having a model that is biased, um or racial bias in the case of uh when we looked at uh race.

So these are the different things that you can do and play around with in terms of figure out um issues within your model, um where they could potentially be on fairness or um bias um or any potential harms. Um In the pa in this one, you can easily see where fairness is the issue. Reliability is the issue, inclusiveness is the issue. So you start seeing stuff like that. The next section is uh explainability. So you can look uh on a global level. Um The dashboard provides uh features on a global level that are driving your model's prediction. So we can see prior inpatient if they were hospitalized in the past plays a huge significant role when uh the model makes a prediction age is also playing a, a major role um to uh determine whether a diabetic patient is gonna be readmitted back to the hospital within 30 days or not.

The number of medications um they're on also makes a difference. Um The lab procedures that were done, they raise um the time in the hospital they spent in the past. So those are a good areas um to look at another thing is um especially when something goes wrong or understanding how um the data behaves, you can also look at individual data. So you can pick on uh especially the ones that uh have incorrect predictions, you can highlight on any one of them. And it will show you what are the key uh features driving a models, uh uh prediction, that sort of thing. Um Then lastly, you can get to counter factuals. So let's say um for this particular patient, um the, the prediction was they're not going to be readmitted. So we click on that and we're trying to see OK, how can we get them uh readmitted? But in real life, we don't want that. But let's say, since this is classic classification, we want to get different outcome from them. The good thing is the dashboard will provide recommended features uh in our data set that we can utilize change in order to um get our desired outcome or see how the model is gonna react.

So once you select um the data point, you can click on uh create what of counter factuals. Um This is our original data and seem like I need to refresh the browser, but all of these should be one. So basically what is trying to do is is giving you recommendations that OK, if you change your original um data record and you change the number of lab procedures that will make um the or you reduce the number of uh lab uh procedures that would make this patient be readmitted.

Um So this, these are examples of suggestions or it could be a combination of suggestions that it'll give that somebody needs to um change in order to get a different outcome. And it also shows the delta um when you make a change. So let's say I select the suggestion, it will show me. OK, what is the delta? What is the change between making those um suggestions? So these are the different um um tools that are available for you um Currently to be able to evaluate your model's debug and be able to find issues um that could be in your model um to get started. Um I highly recommend um visiting these um links. The very first one is open source. Um You can go to github, look at uh some of the example books, uh notebooks and you'll be able to utilize that um do the same thing that I just did and apply it to your own model and start debugging. And there are a lot of features that I didn't cover, but um from there, you can play around and start understanding it. Um There's also a demo workshop um that you can go through and this one uses Azure. So you figure out um the benefits of using this in Azure, it's the same dashboard um when you're uh run your end to end machine learning life cycles.

And this is another useful link um for responsible A I um details and other uh information that you need to use a responsible A I dashboard. So it looks like we're at time so I can go ahead and um end the session. Ok. Any questions? Ok. Well, thanks for having me everybody.