J. Michelle Maraj - Common Data Visualization Mistakes


Video Transcription

I mean, I guess I'll go ahead and get started. Uh I will probably have some time for questions at the end.So if you want to throw your thoughts in the chat or the Q and A throughout the presentation, uh I'm happy to revisit that when we come back to it. But hello, thank you all so much for joining my presentation on common data visualization mistakes. Well, over the past, I don't know, 30 years, 40 years, data visualizations have become easier to create from both charts and even dashboards in the past. Like a li a little bit ago, you might have needed a statistics degree or an art degree to figure out how to, you know, physically draw something on paper and that is an accurate representation of the data. But then we started getting computer tools like Excel or you know, things like tsheets where you can generate accurate visuals based on your data. And now in the present day, we even have drag and drop tools like Tableau or Power B I that help you explore your data and experiment with different types of views though, as these charts are getting easier to generate, you know, just one click of a button.

We've started to get into a habit of clicking that generate button without really thinking if or visuals make any sense. So with the ease of all the tools out there, we tend to overlook different factors. And what I wanted to do today was go over some common data visualization mistakes. So hopefully when you, you know, go back to your job or go back to school and make those next charts, you can make better visualization decisions. But, but I'd really quickly introduce myself.

My name is Michelle Mirage and I'm currently a financial systems analyst at Lyft based out of the Bay Area in California. The USA, I am currently the lead tableau developer on our financial planning and analysis team. And what that means is, I'm responsible for creating and maintaining tableau dashboards that or FPN A and finance team use. I've been in this role for about a, just over a year now. And previously I was in consulting, which was also a really great experience, but it has been so exciting for me to start, you know, doing data visualization and tableau full time. So I've been really happy with where I'm at. What, what are we gonna go over in the short time that we have together? We're gonna talk about chart designs and picking out, you know, I don't know what chart type you're gonna use. We're gonna men talk about inaccurate conclusions aggregating things, color, handling, uncertainty, data quality.

And at the very end of it, I promise I will recap to give y'all one final reminder about what we covered today. But we've got a lot of great stuff and like I mentioned, I hope that I can give you at least one or two tips to take away. So that way as you are putting together your next visualization, you have some other things to think about and you can create the best visualization for your audience. But as we just saw on the agenda, we are gonna first talk about chart design. And so this is dealing with, you know, whether you want to do a pie chart or a chart line chart. Sometimes when we are generating our visuals, it's really easy to just, you know, click a random one, whichever one looks prettiest and go with it. But we do want to put a little bit more effort into thinking about why we would choose between different chart types. So this is the first example that I've got for y'all and I am hoping to see a little bit of activity in the chat for this one. But this is a pie chart that I put together where I essentially collected some data about the different meals I ate over the course of a month and then put them in the pie chart. And I wanna see what do you all think? Which food did I eat?

The most of last month it looks like I ate sandwiches, pasta, rice and ramen. Um put them into this visualization. But do y'all have any guesses as to which type of food I ate the most of pasta and ra I know for me. Well, pasta is like spaghetti to me and then ramen is like a soup. So it's a little bit different. I'm very particular about my noodles. Yeah, it is really challenging to tell the difference between pasta and rice. And I think y'all got the point of my slide. So essentially a lot of data visualization experts will argue that if you have more or two or more items, um two or more categories that you're looking at putting it into a pie chart can make it pretty challenging to read because our eyes have a more difficult time measuring like comparing different angles at different, it's like angles at different angles as opposed to just comparing the height of a bar.

So when I take this exact same data and put it into my bar chart, you can actually see really quickly that I did eat more rice meals last month. And I mean, you can argue on this pie chart and on this side here, what we did was we did put the percentages which do help the audience distinguish much quicker where you know which food I eat the most stuff. But if you do want to get across your story as quickly and effectively to your audience, you know, as minimal confusion as possible. In this type of instance, it might help to use a bar chart. And so as mentioned, I mean, most of the time a bar chart is gonna be easier to read than a pie chart, even though they look snazzy. Now we're talking about bars versus lines and when you need to choose between them. So a bar chart is really important that you do start at zero all the time because the purpose of a bar chart is to measure that length and see how big those bars are compared to one another. Now, I have the exact same data in a bar chart and a line chart. But you notice that in my line chart, I didn't start at zero with a line chart.

You don't necessarily need to start at zero because you are more interested in the slope of the line or that change over time. So the bar chart is gonna be helpful for looking at like total values compared to one another. Whereas your line chart is going to be really focused on that slope and that change what's really important to keep in mind with a line chart is that you x axis does need to be sequential. So you can see it goes 2018 through 2021. There is a correct order. You can't switch the two years because that would, you know, change the slips of the lines and be misleading. If you do have categories that do not have an order to them, then you are, you should be using a bar chart instead because with the bar chart, the order of, you know, those categories doesn't necessarily matter. And with a bar chart, typically, what you'll see is people maybe sort them by like greatest to smallest or smallest, the greatest. So it's really important that you use bar charts when you have like those categorical values and then line charts should only be used if you do have an order to those numbers um that you're putting across that axis.

And then what I wanted to touch on today is 3D charts. So 3D charts are super cool looking. Uh It's a lot of fun. I know that it's easy to generate on like Excel and different tools like that. But the tricky thing with bar charts is they can be a little bit challenging for your audience to read. So this is a bar chart looking at the cost of a widget across multiple days. And if you look at Thursday and if I asked you, what is the cost of the widget on Thursday, it might take you a second because are you supposed to reference the back of the bar? So that's 75 or are you supposed to look at the front of the bar which might be like 70? So because with a 3D bar chart, there's that possibility of error and misunderstanding, you know what the exact value of the chart is. So data visualization experts would recommend that you actually stay away from 3d bar charts. Most I don't know all the time that I can think of. A two D bar chart is going to be cleaner and easier for your audience to read. So as far as charts go, what we want to keep in mind is there are pros and cons to the chart type that you choose and it really depends on the data that you are displaying.

So we wanna make sure that you are making those smart decisions when you are picking out which chart type you want to go with. Next, we are gonna talk about inaccurate conclusions. So with a data, data visualization, typically you want to put some data together so that way your audience can draw some type of conclusion from it, whether that's, you know, exploring the data learning a little bit more about it. But as you are putting together these visuals, you wanna make sure that you aren't creating misleading conclusions. And what do I mean by that? This here is a scattered plot of ice cream sales versus sunglass sales. So you've got the ice cream sales on the left and the Y axis and then you've got sunglasses sold across the bottom. And what we can see is as the ice cream sales increase. Uh So do the sunglasses, sunglass sales. Uh You can see that there's a pretty positive correlation there. So my question to y'all is, are we selling more sunglasses because ice cream sales are going up or are we selling more ice cream? Because sunglass sales are going up? And I mean, they're pretty closely correlated and it looks like there's some type of relationship there. Right? Well, not so fast. Oh yeah, Lisa's got it.

Correlation is not causation just because you can plot two variables on a chart does not necessarily mean that one factor drives the other. And the example that we've got here, maybe there is some type of external factor that's leading both of these to you know, increase or decrease at the same time such as the weather outside. Maybe because it's sunny, we do want to sell more ice cream and more sunglasses. So again, just as you are putting together these visualizations, you wanna be careful as to what types of conclusions your audience could draw from it and making sure that you know, you either like explain what's going on with the chart and consider what other factors might be out there.

This is another visualization I saw online that I always like to share when talking about this topic. It is the number of people who drowned by falling into a pool, plotted against the number of films that Nicholas Cage has appeared in. And so, I mean, from a statistics perspective, these are pretty closely correlated, right? Like it's pretty rare that you see a visualization where two data points like follow each other this closely. And so the question is, is nicolas cage making movies because people are drowning or are people drowning because of nicolas cage movies? And the thing here is again, just because you can plot these two data points on a chart does not necessarily mean they're even related at all.

Maybe, I mean, maybe one drives the other, you probably need to do a little bit more research into it. But something to consider is maybe, I mean, it could be chance. And my third example here is from World War Two actually. And so this here is during World war two, the US military or air force uh wanted to figure out how to reinforce the planes uh that they were sending out to fight. So, you know, you cannot reinforce the entire plane because it's expensive, it's gonna be really heavy. And so if you had to only reinforce, you know, some areas of an airplane, where would you want to do it? So the way that the United States decided to figure out where to reinforce their planes was, they looked at all of the planes that returned from battles and plotted on this visual here where the majority of the bullet holes were shooting at a plane. Now, I think when you look at a visual like this, the first thing that comes to mind is, hey, we should reinforce where the planes are getting shot. So maybe reinforcing, you know, the wings and the tail. But the issue in this instance is we are only looking at the planes that survived that ended up coming back home. And what we're discounting is the missing data, the data from those planes that did not actually make it back.

So, I mean, you can kind of infer from the visual, the planes that didn't make it back might be the planes that got shot on the nose. Um And on that area of the tail. So we now know that this is called survivorship bias because you are bias in your data based on the data that you are collecting that's coming back to you. And so maybe in reality is we should be reinforcing the nose and the tail so that those planes can be coming back. And this is just a really interesting sample that again, just because you can plot it on a chart, you gotta take a step back and think about what conclusions you're drawing from it. I think originally our eyes go to those bullet holes, but we nearly need to consider the full scope of the problem. So that way we can draw the right conclusions from the visual. So tip here, correlation does not equal causation aggregation issues. So aggregations are finding ways to summarize our data, whether that's adding it all together to get a sum or taking some type of average. But there are a couple of factors that we want to consider as we are aggregating our data.

First, I want to talk about mean versus median. So let's say I took five lift rides with totals being $12.16 $16.18 dollars and $83. So if my friend comes up to me and says, hey Michelle about how much do your rides cost, I could give my friend the mean value which would mean that mean mean me averaging those values together which comes to $29 or I could give my friend the median value which is that middle point, middle data $0.16.

No, the thing is neither of these numbers are necessarily wrong. I'm well, I did the math a couple of weeks ago but I'm assuming the math is right, the map is right. So deciding whether I want to give my friend that mean or the media and it changes the story a little bit. My mean might be a better representation of, you know what my overall trips cost including my $83 outlier to the airport. Whereas my median value might be a better representation of what my daily costs look like or what my like work commute costs look like. So I mean, again, neither of these values is incorrect, but the value that I pick can tell a slightly different story might make my rides look a little bit more expensive or a little bit cheaper. So this is just something that you need to consider as you are reading some of those metrics on different dashboards or as you're deciding how to summarize what you are presenting to your audience. So this is a really interesting case study that I came across um or read about in a book. And so here this is the blog Graphic Zoo on Tumblr.

And what they did was they want to look at figure out what state is the most dangerous to drive in. Now, I know that we've got a lot of us audience members today, but what they did was they plotted the percent change in traffic fatalities between 1975 and 2012. So we can see at the top top of this visualization DC saw the biggest decrease in traffic fatalities. So that means like the number of accidents is going down in DC, not what we know is near the bottom states like Nevada, Florida, Arizona, they saw an increase in traffic fatalities. No, I mean, does that just mean that those states are more dangerous to drive in because they're having more accidents. Alberto Cairo, who is one of my favorite data visualization authors as because he's from Florida, he wanted to investigate this a little bit more. And what Kyra wanted to do was he wanted to maybe try aggregating the data a little bit differently to see if we would get the same results. So the visual on the far left is what Kairo did was he took the number of traffic fatalities between these two time periods and divided it by the number of vehicles on the road per 100,000 vehicles.

And what he found is that all of the states on the list completely shifted order. Uh It looks like Wyoming was now found to be the safest and Florida, which we saw was previously at the very bottom. It did move up a little bit in the rankings. And what we're seeing is that no states saw an increase in accidents. When you normalize by the number of vehicles on the road. Then he wanted to try another example. So what he did was he looked at between 4012, the number of accidents per million annual vehicle miles. So the number of miles actually driven on the road and again, the rankings completely changed. Florida increased quite a bit. Um And then like all the states moved around and we see North Dakota actually seeing an increase in traffic accidents now. And so the overall question here is how should we be measuring danger? Should we be looking at the number of accidents per state? The number of accidents by the num divide by the number of cars on the road or divided by the number of miles that we drove. And again, there's no wrong answer. It really just, I mean, it depends on your audience, it depends on. Yeah. So you're presenting to what data is available and really what we need to keep in mind here is if we aggregate the data differently, it tells a completely different story.

So as you are looking at, you know, reading some of your own visualizations or putting together your own charts, you do want to consider if the data was aggregated any differently, would your story change? So as I mentioned, consider how these aggregations can impact these visuals overall.

And Kara, you know, I'm not sure um what the what kinds of cars they are looking at. But that is another good consideration to keep in mind. Now we are gonna talk about color design. I love colors. Um I don't know, I just like love all the colors of the rainbow and it's really exciting to start to put some colors into my visualizations. But as we are putting together or designs, we do want to be making smart color decisions. So if you've got a line chart, an area chart of 50 states and you throw the color bucket onto there, your chart is gonna be impossible to read. Nothing is gonna be helpful and you're not getting any, you know, additional value out of adding color to a visualization with this many data points. Typically, if you have more than I wanna say four or five categories, color is just gonna confuse your audience and make it too challenging to read. If you have to go back and forth between, you know, reading your legend and reading your chart, it's gonna be too much work for your audience.

One way to get around this, you know, first off is not showing as much data at once and not showing as many categories at once. But another way to get around it is to use color to be selective and to really highlight the most important points as opposed to just coloring everything on there. So one of the methods that I personally have been using quite a bit is I use one color to highlight either my most important data point, whether that's a state that I allow my audience to switch between or you know, my company compared to others. And I put the other data points in gray scale. And so this is a good way to show one. You're highlighting the most important data that should be the most important to your audience, but you're also giving that background data for context. And so in this case, like you can see, I hide it in pink, my company or line or state. And I've got those different uh companies in another color. I've also been starting to use this to show like year over year trends. I might highlight the current year in one color and then past years in gray scale. So that context is still there without distracting my audience as much. Yeah, it does look like sand art.

But another thing that we want to consider as we are putting color onto your visualizations is making sure that our visualizations are accessible to the wider audience. So a good portion of the population, I think it's mostly in men, but a lot of people are red, green, color blind and a lot of requests that I see for a lot of dashboards that I personally build as well is hey, can we show red being bad and green being good? But if you do this, your a your entire audience may not actually be able to distinguish the difference between the two. So you can see in the heat map here if you can see red and green and the visual on the left shows a red and green heat map. Whereas the visual on the right shows what somebody with the color blindness would see as a result. And it's really challenging to distinguish the difference between the two. The number one recommendation for, you know, tackling this is to not use red and red and green when Tableau is my tool of choice and Tableau's default colors are actually blue and orange because they are more easily distinguishable by people who can't necessarily see the full range of colors.

But let's say maybe your boss or your customer for whatever reason, really, really, really wants red and green on their dashboard. There are some other ways where you can tackle this. So let's say we have the visual on the left where we want a green indicator for increases and a red indicator for decreases. One way that we could tackle this is use the shades or the hues to distinguish between the two. So like a really light green and a really dark red, but that's not my personal favorite. What I would actually prefer is to incorporate something other than color such as shape to distinguish between your two metrics. So in the second example, here we have a green arrow showing up a red arrow showing the decrease if all of these symbols and colors is making your visualization really busy. And maybe you only want to highlight the most important or I don't know things that you need to pay the most important attention to. So maybe I will just put a red X on the vis on the data point where I really want my audience to look. So there are some ways where you can get around like showing red and green, but still making sure that your visuals are as accessible as possible. So keep in mind that you should consider how color impacts your visual.

Don't just throw random colors on there really think about how you can use it to enhance as opposed to distract uncertainty. So what do we do when we have missing data if you are missing data in your data set? Um You've got a couple of options on how you can handle that. I mean, if it's a cat categorical data, so you have a list of categories, maybe you exclude that category because zero implies that there's no, I don't know, no sales, no numbers there. But if you have something like a time series, you might need to come up with some other solutions. Let's say I have the visual on the left where we have R the number of rides over time, January through April. But for some reason, I don't have February rides. You have a few options for how you can handle this. You could default it to zero, which a lot of systems that will collect data will typically default to zero. If you don't put any numbers in there, you could average the previous and the next point. Some like taking that January and March seeing where the line might approximately be. Similarly, you could take the best fit point. So drawing a line of best fit guessing about where February would be or you can leave the values of Noel. And so just completely leaving it out.

Now again, there's no right answer. It really just depends on what your data is. Who you're presenting the data to. If I am just talking to my manager and the purpose of the visual is to highlight data issues. Maybe I would leave it null if my, I don't know my VP is interested in trends for the remaining year, maybe in that type of case, an estimation would be OK. It really just depends on your audience. But the, I mean, like I mentioned, there are some options that you've got. And from Kyra again, another interesting way of showing those missing values. We mentioned that, you know, sometimes you can just exclude values, but if it is a Time series, you can't just leave out a couple of months. And so the visual on the far left, not ideal because your audience is going to be expecting those months there. The middle visual is a little bit better because it shows hey, there's a couple of months in here, but we don't necessarily have that data. But the third example is actually really interesting because it shows it actually estimates where the data might have been, but then calls out that it is missing. And so it allows your users to say, hey, this is about where those months might have fallen.

But we do acknowledge that that data isn't there. So just another option and final option for uncertainty we've got if maybe your data points are true estimations and the data could actually fall into a range. You can show this visually in your charts. Uh the chart on the left, it actually has it's kind of like a box and Whisker plot, but it has that minimum maximum so that you could see like your audience understands that your data points could fall into a range and the shading and the second chart, it's actually really interesting because it allows your audience to see no data could fall into the range.

It's an estimation and it helps you know, be a better representation of what is going on with your data set. So the biggest tip here is when you are handling uncertainty, you wanna choose a method that provides the most clar clarity to your audience. And so what you wanna do is maybe that also involves adding a footnote or adding some additional context around your chart as to what is really going on. Now, uncertainty is a little bit similar to data quality because sometimes data quality, your issue could be missing data.

But what I really wanted to talk about as follows quality goes is understanding the sources of your data. This is really important if you know you are the data visualization designer, but also as you are reading data visualizations put together by other people. When you are looking at a visualization, there are a few items that you want to consider. Is this author experienced and reliable? Do they understand the data that they're working with? And can you trust them to you know, visualize it in the most correct way possible? What organization created this visual? Could the author have an exterior motive? And so as we talked about like with incorrect aggregations, the type of aggregation method we choose tells a slightly different story. So could the author be having sharing his particular story for a certain reason?

And did the data come from a reputable source? Can you actually trust that the data was collected correctly accurately? And fairly now, this visualization on the left is plotting out the start of COVID in the United States taken from March and it is actually a visualization that I put together. Now, I will tell you I am not an expert on pandemic data, not an expert on health data. I just saw the cool data set online and thought it'd be interesting to visualize, visualize. So when we talk about did the data come from a reputable source? It's John Hopkins data. I do trust um the university to collect accurate information, but we wanna talk about the author me, I put together my COVID map. Uh It was I put it together maybe a week bef like as the pandemic hit. So like mid March and published it online, shared it to all my friends because I was so proud I put together something cool and essentially elected alone. I revisited my data visualization about two weeks later and I realized I had not aggregated my data correctly. I had taken about like 30 seconds to look at the data source and didn't realize that some rows were duplicated just because of how the data was structured. So in my visualization, I was showing like three times as many cases in these different areas as were actually there.

So I mean, I am personally ashamed to contribute to false information about the pandemic, but maybe I am not the most trustworthy author when it comes to this type of data. And so that's something that you want to consider as you are mostly as you're looking at visualizations is do does the person putting together this visual really understand that not only like the data source, what the data points mean even the data structure, um that's something that you do want to keep in mind.

And especially as you are putting together your visualizations, you wanna make sure that you fully understand your data before drawing these conclusions and sharing them with the wider audience. So I mean, I don't think I have an exterior motive but in this instance, I would not have been the most reliable party to get COVID data from. So really you just want to question who your author is and then if you are the person putting together your visualization, you do want to volunteer details on how you created the visual ahead of time. So your audience is gonna have questions on who made it? What organization are they from? Where did they get the data from? How is the data collected? And so if you are designing the dashboard, you do want to include that information somewhere, whether it's on your dashboard directly in a footnote on your chart, your audience is gonna want to know that they can trust you. So you do wanna make sure that you can provide those details if possible. But those are all of our tips for today. So in summary, this is the, you know, one slide you want a screenshot for today.

We talked about chart types and how your chart type can persuade what views you've got. We talked about conclusions and our correlation does not equal causation, aggregations. You aggregated it differently. Would your story change? We talked about color and how you don't want to distract your audience.

You want to use it to enhance your visualizations. We talked about uncertainty and you know, showing missing data. And we also talked about data quality and essentially questioning who the author is and questioning if you can trust your data source. So a lot of great stuff, um a lot of common data visualization mistakes out there, but I'm hoping that there is at least one or two tips that you can take away and consider as you're putting together your next visualization. So thank you so much for joining me today. I really appreciate that you chose, you know, my session to come to if you are interested in talking more about data visualization or you want some links to a couple of resources, definitely feel free to reach out to me. You can find me on Twitter, linkedin email. Um I'm pretty good at responding to those things. So thank you so much again and I will quickly look over. I don't see anything in the Q and A, but if you do have any questions, I've got a couple of minutes left and I am happy to take those now and I'm gonna scroll up to make sure I didn't miss anything. Let's see. I mean, I'm just making myself hungry for ice cream right now is what I'm getting out of working up here. Send out. Yeah. Color blindness. I know. Yeah, color blindness is so important to keep in mind.

And I mean, it's not something that a lot of people typically think about when they're putting it together visualizations. So that's why I always think it's important to talk about it. Yeah. And yeah, there are um like different Softwares out there that let you look at whether or not you can like, upload a visual and you can see how it looks too, you know, different types of color blindness. Let's see. Well, thank you all so much for coming. All right. Well, it doesn't look like we have very many questions. So I will just say thank you again so much for joining and I hope you all have a wonderful afternoon or morning depending on where you're at. But thanks again.