Decoding Statistics in SQL Server


Video Transcription

Thank you so much for taking your valid time for attending my session today. Um My name is Dhi Gouri. I'm a SQL server professional working with SQL server in several years now. And I'm a public speaker.And today uh I would like to present on the topic, the coding statistics in SQL server. This is my second session back to back. Uh So thank you so much, Anna for giving me this wonderful opportunity to present for UCH conference. It's an honor really. So a little bit about me, I'm a database administrator with seven years of experience in managing SQL server and databases. I'm a Microsoft certified trainer and Microsoft certified solution expert. I did my masters G at Eastern Illinois University and I do blog for DB and nuggets.com.

That particular, you know, website is especially on SQL server. I do blog regularly and I am very active on Twitter. Um Like I, I do speak a lot at different free conferences and as well as like most of them, like 99% of them, I speak for free conferences and 100% I never spoke for any conferences which appeared. So, yeah, I mean, I'm like freelancer, I do free stuff. I help professionals across the world like training them, you know, speaking at different conferences at DB N I get is my user ID. And like if you need any help, you know, if you need any help, find like looking for the jobs on SQL server. If you need to know about SQL server, wanted to learn anything. Get the learning resources. Please do contact me. You know, I will, I will provide you with all the resources to get started. And that's my email address as well. And I did not mention a couple of things here. I'm a co organizer for Microsoft Data and A I South Florida user group. And I'm a co organizer for Microsoft Data Equity inclusion and diversity user group. And I'm a uh you know, I'm a volunteer for Microsoft Data women in technology user group as well. So if you would love to join our groups, just ping me, I'll give you the locations to join the groups. You know, we are very active uh user groups.

These are free training provider for uh professionals across the world. Um Mostly these are virtual groups. So every month we join, you know, some time and then we train people for free. So this is under the, you know, Microsoft Microsoft has taken the initiative to build all these wonderful user groups, helping professionals across the world. So, um and also I am a committee member for Professional Association of SQL server, the past summit. If you know the past summit, it's uh the number one conference uh especially focused on SQL server. So I'm a committee member on that as well. So yeah, so what are we going to discuss in the session today? If you do not know about the statistics, this is a session for you. So you, so we will go overview of what are the statistics, how to view statistics and what happens when they go wrong. You know, Sequels are completely depends upon these statistics and how to update them. So statistics, what is it all about statistics are the you know blob structures which are saved inside the system tables to describe the distribution of the data inside the columns or set of columns or an indexed view.

Basically the query optimizer will use and trust these statistics to determine how many rows it thinks going to come back out of a table or a column, which is also known as cardinality estimates. And you might be thinking so, so what why do we use these statistics? Right. Statistics are used by SQL server. You know, if your statistics are accurate, that will help the optimizer to generate that good enough plans for our queries if they really go wrong, right? If they are really bad, if we did not update them properly and if they are out of there, then that will cause some serious issues, you know, inside your queries and you don't even know why your queries are running pretty slow. So if your queries are like, if your estimates are way too off, um when compared with the actuals, then that will, you know, cause the optimizer to choose the non optimal plans for your queries, you know the next time. Firstly, for, for the first time, if you run any query, the execution plan gets generated right, and that plan will be saved inside the planned cache. So if your estimates are way off, a wrong plan will be selected and it will be placed inside the plan cache. The next time someone is running the same kind of query, the same query, they will go ahead and use that plan, which is wrongly created, right? Because our estimates were wrong. So keeping the estimates up to date, making sure we are looking at our statistics is important.

So why should we care? Like I said that if you know uh if the estimates are way too off, it will cause slow response times due to these inefficient plants that get stored inside the plant cache. And excessive resource utilization also occurs like CPU memory and IO and because of these wrong estimates, even the execution plan, the operators that we have inside the execution plan, like the joint operators or C operators, even they go wrong instead of doing C it will do scan or other things might break in between.

They, they they will become uh non optimal, you know, and if your estimates are way too off, like for example, if SQL server estimate is saying, mm I think only four rows are coming out of this. I'm sorry, I think only four, you know, but then if you execute the query, 4 million rows are coming, that means these estimates are way too off, right. So, but you know, estimates like based upon these estimates, the memory grants are allocated to the query even before going into the execution phase. So whenever you run any query, even that query before going into the execution phase, depending upon these statistics, it will get couple of memory grants. So once the query goes to the execution phase, now 4 million rows are coming, right? But then once it goes to the execution phase, SQL server cannot do anything, it cannot just go ahead and allocate the memory for that query right away. It cannot do that. So what it have to do, it have to spill to disk. That means it will go ahead and use the term DB which is a terrible idea, right. So all these are happening because of these wrong estimates, our memory grants are depending upon these estimates.

So we have to make sure that these are accurate and you know, this will also cause reducing the throughput and concurrency as well. So, so important, you know, so important to keep these statistics up to date. So like you said, query optimizer will be depending upon the statistic to build all the operators inside your query plans. Like for example, like what structure should I use? Whether should I use clustered index or non index or whether I have to do cr scan or any lookups are needed or what kind of joint should I use? You know, has joint ned loop mergin or whether I have to go serial or parallel, whether the query have to go serial or parallel or whether it have to go batch mode or row execution mode. Like all of these operators, like, you know, depending upon these statistics, like if your statistics go bad, all of these downstream upstream operations will go terribly bad and the non optimal plan will be generated. So what could go, there are several things that can go wrong.

Like whenever the query goes to the tra query goes to the um optimization phase, the query gets first passed and then it will go to the several transformation phases. It will go ahead and check uh like what are the cardinality estimates like whether they are accurate? Like how much it is saying depending upon these estimates, right. Each of these are um whenever the translation happens, there are several query list that get generated. And on each of the facts, the cardinality estimates have to be checked. And for each of the query tree, the cost is also checked. The lowest cost query tree will be the winner. Yay. And that particular query plan will be saved inside the plan cache. And once the query goes to the execution phase, like I said, the plan that is saved inside the plan cash will be used. And the memory grant will be allocated based upon those query uh cardinality estimates. And once the query goes to the execution phase, it will give us the results. But if this particular cardinality estimates go wrong, the memory grant that will be allocated will also go bad, right? That means the cost is also suspect here. So just because of this one cardinality estimate going wrong, all of these operations also go wrong and non cardinal like uh and the non optimal plan will get generated and that will cause all the issues, right? Uh Yeah, all of these operations will go bad as well.

So let's dive deep into the database level until now we have seen what are the effects? Why we need to make sure the statistics are up to date? But then let's dive deep into like what are these statistics? Where should we find them? Right? If you go to the properties of the database, right? Click on the database, go to the properties. You see, especially for these statistics, you see a couple of options, auto create statistics is always turned on by default and please make sure these are turned on. Do not go ahead and turn off because whenever sequence over three important to create this study, any particular. But if you're running any query and you're searching on a column, right? And you are running that for the first time Sequels of a thing. Hm. It will be a good, a good idea for me to go ahead and clear the statistic because the query might be run again, right? So can create the status on that particular column. So it is so important to keep this updated, like keep it as true. It is by default, it's true. The second so update stats, you are inserting data, updating, deleting data and threshold, but the statistics. So it's not update the stats. So even this person in 85 so please please not turn this off, especially for certain scenarios you may think of turning it off.

But 90% of the times you have to turn this on because SQL server will have the ability to go ahead and update the stats outdated stats automatically behind the scenes, even without your knowledge, you know, that's very important. The third, the third um option is auto update stats asynchronously. This particular option is turned off by default. We are going to talk about that later down in the slides. And the fourth option is auto create incremental statistics. This is um turned off by default. And if you only have the partitioning on your tables, you may be thinking about enabling it. But other than that, you just, you just can leave it as false. We are not more focusing on this, but we are focusing on the other things like auto create update stats because those are the important ones. So there are many weight starts that created, you know, it can create manual like automatically uh it it will get automatically created whenever you create an index. Like if you create any clustered index, if you have any clustered index, then automatically a stats get created with the exact same name of the index. Uh but both are different, right? The stats and the index are completely different index will actually store the data itself.

But the stats will just provide you the information of um you know how many rows it might think um gonna come back like the latest information. Like whenever you update the stats, the latest information will give you that information, but it's not the exact data, not the data gets stored inside the stats. That's the index where the data gets stored. So every index that you create every filtered index that you create every index with that you get the same name. Um behind the scenes, the stats gets created. For example, I'm going ahead and creating a non clustered index here, right? On customer id, sorry, on the customer id, on um customer id column, like the ID column, sorry. So whenever I created that, if you go and check the stats by using the sys dot stats view you can see on the right side of your screen, the highlighted one with the exact same name stats get created on the exact same column territory ID, that's automatically gets created, right?

And in the columns, if you're searching on it, here we are doing it on the column color, right? So whenever you and then go ahead and start, start, you know, starts get created, you know, but this time a different name, right? Uh It says system generated stats. Now how do we read that stat underscore W Washington State underscore six represents system generated underscore 0006 represents the column id inside your table. The color column id is of six, right? And then underscore 08 B five, that number is a Higgs are decimal value of your table.

That's how we read it, you know, and whenever you're searching on a column, automatically the stats get generated on that particular column. But also when you're joining on a column, right? This time I'm joining on list price column, then whenever you go and see the system tabs view stats gets created, you know, e exactly like the system generated stats will get generated on the list price column. And also we can manually create the sta it's not like only automatically gets created, we can manually create on multiple columns, single columns or even the filter values in a column, know how to view the stats, right? Like I said, by using the stats view, you can get the list of all the stats we have for a table and to view in special stats in detail, you can use the D BC C command D BC C show statistics. You see three kinds of information for that particular stat header density vector and histogram. We are going to see these in the demo and we for the latest versions of SQL server, we have these two commands that we can use, especially for the header information, you can use CD M DB stats properties and especially for the histogram information, you can use CIS DM DB stats histogram.

But my favorite is D BC C source statistics because in, in a single shot, I can see the entire information about that particular ST this particular session is more into demo driven. So let's go ahead and uh see my demo inside the management studio. Um So I hope you are seeing my um you know, my present, like my management studio. Can anybody confirm in the chat that you are uh able to see? Yes, should help. Thank you. Um Yeah, so let me go ahead and share my management studio. Um So I'm using Adventure works 2016 database which is provided by SQL server for the demo purposes, you know, for testing purposes as well. So let me go ahead and use that database. Oops. OK? And to like I said, like to view all this um all the indexes on a particular table, right? We have sales order header table um I just, you know, added bunch of data into it. So that's why I put in large. Um So I'm executing that. So this particular information it will show you the index name, index description and the index case. Now, while I'm I'm talking about statistics and I showed you the indexes. Why, why did I do that? Because we are mainly focusing on this particular non clustered index. I went ahead and created a non clustered index on uh you know uh on the. So did I do it correctly? Let me see. Uh yeah, on per personal ID.

So when I did that, it went ahead and you know, it also it went ahead and created an index on that particular column. So do you my dad right in a um uh and we will see the primary key related stats and then non it's that and, and I um you know which column we have the surgical stats that are created head stats will help with that, but it is actually deprecated. But until they completely remove it, we still can use it because it will provide us nice information like the stats name. And on which column does the stats got created right now, let me go ahead and uh see D BC C show statistics on this particular stats. Let me go ahead and do the BC C show statistics. Provide the table name and then the stats name could not locate um hm demo demo, demo, broken demo. Uh So let me go ahead and see why this got broke. Come on sequel server. You cannot do this to me when I am in the, you know. Oh my goodness. Mm Why did my demo break? So basically what it is saying is that um uh the next one, maybe, maybe I'll create this index which will actually um create the tab. Uh Could you think I named that create index on? OK. Something bad happened. OK. Let me go ahead and use this to just show you, maybe I'll use this particular um stat to show you how the stats look like.

I didn't expect my demo to break, but that's completely fine. You know, let me show you how the index starts look like. Oh my God, what happened? And the system catalogs, OK. So uh let me go ahead and see my stats. So right now we do have all of these tabs, right? So let me go ahead and copy this the person ID and let me show you the stance on this particular sales personality. Now, we can see that, see now we are able to see. So to know any information about any particular stat, right? So let's go ahead and run the D BC C show statistic provide the table name and then the stats name. So we will get the three kind of information, the header information, the density vector and the histogram information.

So inside the header information, we have the name of the stats. When was the last time the stats got updated? How many rows were there during the time of update and how many rows were sampled? So out of these many rows, you know, out of these many rows, we have uh these many uh rows sampled. That means it's not 100% sample if you see here. And how many steps did we have? There are 18 steps. That's all. But for the maximum we have 150 steps. Um 150 steps, we have uh any steps in max and uh sorry, not 150. I think it's 200 steps we have. And with additional one step which makes two, not one steps. That additional one step is for if you have any null values in your column, no. And then whether it is this particular um this particular stats is filtered or if it is not filtered or how many unfiltered rows. So this particular stat is not filtered. Um So that is the reason we have no unfiltered 100% right? It's not filtered. And when you see the density vector, if you observe here, um how do this particular density vector value is calculated? One divided by total number of unique uh unique values we have inside this column that will give us the density vector value. So uh that's how it calculated. So the less amount of value we see here, the more unique our column is right.

If you see like 67, that means it's not so much unique. If you see 0.000 that means it's pretty unique. And how do we read this particular um histogram information? Let me show you the first column is range HIY range HIY, meaning we do have the values of this particular salesperson ID values. We have not each and every value is um able to fit in, in, in this range hiking. Uh Because we, if we fit each and every value, then we cannot fit in 200 steps. Right? So that is the reason we have couple of values here. And if you see the range rows, it's, it shows me as zero because my demo broke. I used some other um some other column. Um So range rows, actually what it shows is let me show you a different um a different um stats. So that may maybe we'll see some values in it. Uh Maybe we'll see uh maybe sales order ID. Let's go ahead and see the that maybe we'll see, maybe we'll see the histogram with um oops um same sort of, same sort of ID instead of doing the customer ID, I hope it will show oops. OK. So, right. So we have seen a different um a different statistic just to show you the range rows right. So here, if you, the single um value cannot be stored inside the range.

Now, if you see here, we have 194 steps over here previously, we just had a couple of rows like 16 because the data is not much in that particular statistic. But here we do have a lot of data if you see rose and rose sample. So how do we read the range if I'm searching on um the sales order ID, um sales order id of value si 61,688. Before going into the range rows, let's see the equal rows. So for the for the customer ID, like the sales um sales order id value of 61,688 we do have zero like we do have one equal rows for this particular value, we do have one equal row. But if we are searching on sales order ID value, which is of in between these two rows, how do sequel server will do that right? There should be a way how it will estimate that value. So what it will do is it will go ahead and get the range rose value like range rose is nothing. But how many values do we have in between these two values? The previous and the current value is these many values? And the distinct how do this uh particular column get calculated is how many unique number of rows, we have like in between these two values, right? We have 510.

So if we are searching for any value in between these two rows, like let's say we are searching on sales order ID, six thou 61,689 that's in between these two, right? So you do not know where to get the estimate. So it will go ahead and divide the range rows value with the distinct range rows, which will give you 23.15. That's how the SQL server will actually um calculate the uh you know, the estimate. That's how it will get the estimates. Now, let me go ahead and see if we can do um the other demo. I do not know if it has actually. Um So as I told you, you can also get the header information, especially by using CD MD D stats properties if you run that. Um OK. Mm Because the object ID is completely different. Let me see the object ID hm Fema. This, this is what happens when your demo breaks in between the sessions, right? Let me see this if this query runs my, you know, uh I, I did not expect it. I checked everything, trust me, I checked everything I did not expect this to break today. So this is actually running. So the problem is like my shut down. So I have to, you know, immediately take a new laptop. Set up all my demos which did not work properly. You know, this is what happened. So this particular query is on. OK. So let's go ahead and see um how this particular, um you know, we don't have the tabs on this particular customer ID value.

But see what I can do here is since my demo got broke, I have this wonderful session that I have presented. Um you know, uh I, I uploaded it to youtube. Um The conference that I have spoken, it is like 1.5 hour session, especially on the statistics, the same session. Um It is upload, uploaded to the youtube. Um If, if you all are OK, because my demo got broke, you know, I tried my level best, you know, to set up my demo, but I don't know what has happened. My demo got broke. Uh But I can still provide you the video of a complete, you know, 1.5 hour session. Let me go ahead and share that right away. Um uh To you because, you know, we will still do the slide part yet. Only the, my demo break doesn't mean that my, you know, presentation broke. Uh So let me go ahead and understanding statistics SQL server that I have given for a uh group eye session. So this is the session which I'm talking about. Let me go ahead and share you that. Um and I'm really sorry guys, you know, I didn't mean to break my demo. So this particular session, the same exact session, you will not miss anything.

In fact, you will get that additional additional information, a deep dive content on the statistics. Um So let me go ahead and share uh my slides back again. Um So we have certain um certain uh like something known as threshold value, right? So these statistics like if your auto uh auto create statistics, auto update statistics are enabled, which are enabled by default. What SQL server will do is it will go ahead and automatically updates those stats which get outdated. But then to get that auto update, ST to be kicked in those particular um uh columns that are created for that particular stat, right? Those columns have to be updated, right? Um Like at least like 2 20% plus 500 number of rows have to be changed to make that particular stag to be outdated. Once it is outdated, the next time the user is running the query using the outdated stats, then the auto update stats will kick in until then if like you have an outdated stats and nobody's touching that nobody's running any query um using that outdated stats, right? In that particular scenario, auto update stats doesn't get kicked in the next time the user runs it using that outdated. Then auto update starts kick in and this particular threshold which is 20% plus 500 number of rows have to be changed.

That particular threshold is a old threshold and it is before sequel server 2016. But for the new sequel server, uh SQL server like SQL server 2016 and the latest versions of SQL server, they have given this new threshold which is square root of 1000 times number of rows which is like very, very less like if the old threshold is like 50,000, the new threshold will only have to keep 4000 number of rows for that stance to be outdated.

And the next time the user have to run any query using that outdated stats, auto update stats will automatically kick in and then you know it it will do the it it will do whatever it have to do behind the scenes. So I have told you that I will be talking about auto update stats asynchronously, right? So this particular option is the turn off by default. But you know, uh why is this useful, especially this particular one? Like if your auto update stats is turn on that beautiful that automated system like sequel server will go ahead and do that, do what you have to do, update the stats nicely right behind the scenes. But then if your users are waiting until these tabs get updated, right? The next time the user is actually running that outdated stats behind the scenes, the stats get updated and the old execution plans will be invalidated and a new plan will be generated using that new stats information. And while these things are happening, the user have to wait, you know, he he will not see any results because um uh what you have to do is you have to update the stats invalidate the uh already existing um execution plan, create a new execution plan and then use that execution plan, newly created one for his query.

Just imagine if that will take long time, right? The user have to wait for a long time. You you know, usually 90% of customers do not have any problem while this thing is happening. But if you are like in an environment where you know, the users are like they do not want to wait even a second. You know, in that particular scenario, you can go ahead and turn this option, auto update tabs, asynchronously to be turned on. Now, what will happen if you turn this on right? Once you turn this on once, once your stats is invalidated, the next time the user is running the query using that invalidated stats. If this particular option is turned on, what you are say, saying to SQL server is a SQL server. I know that I know you know that the status is out of date and I know you know that the user is a query using that outdated set. But please please do not update it immediately because my user doesn't want to wait, please do not do that immediately, but instead run the query that the user is running, give him the results by using the same outdated stats by using the same old execution plan. But once he get the results on the screen, please go ahead and update the stats later, I we believe that it happens immediately after he runs the query after he gets the result, you know, um because it happens automatically behind the scenes without interrupting anything, you know, and in sequence over 2019, if you are waiting right in sequence over 2019, if your user is waiting, if your auto update status is turned on, and if this particular auto update stats is synchronously is turned off, you will see the weight types like wait on sync tab refresh while your user is actually waiting for the query and while your stats are getting updated, you know, and as I told you like when this particular option is enabled, updates happen asynchronously.

Mm So even though your stats are updated, um you need to make sure you maintain them by using the maintenance plan, you know, by default, we have those uh regular maintenance plan like uh inside SQL server where we build the packages right for maintaining our um stats information.

But in but that particular um maintenance plans that we create do not have much options like uh all you can do is, you know, update the stats by using the full scan or by default, that's it. But if you want to have more options to update, you know, update the stats, then Ola Henrys the free, uh you know, free that you have uh written by Ola Hellen Gren, you have more options to choose. How do you want to update these tabs? Whether you have any specific tabs that you wanted to update in a full scan, like 100% those kind of updates that those kind of modifications you can do by using Ola Heen steps, it's completely free. You definitely want to check it out. And there are a couple of more options um that you can do inside the stats. One is no recompute and the other is versus sample person. So what is this? No recompute, no recompute. Meaning uh that um if the, if there is an outdated stats, right? Um Please do not go ahead. You are telling SQL server that please do not go ahead and um automatically update the stats in in between the day, right, in the, during the business day. But then go ahead and update the stats when I mention you as update stats by using no recompute option, then go ahead and do that. But until then, please do not update the stats in between the business day.

You know, even though your stats are out of date and there is the other uh other option versus sample person. Now, whenever uh you use this option while you are updating the stats, usually the person will be 100% you're telling SQL server that see, even though you find that outdated stats, even though user is running any query using that outdated stats, please do not go ahead and update that right.

Instead, go ahead and uh use the person that I'm mentioning here. If I mentioned 100% then go ahead and just go ahead and update with 100% you know, scan of the statistics. And that's um that uh feature has been. This particular option has been enabled from SQL server 2016 SP one CU four SQL server 2016 CU one and for the later versions of SQL server. So we have seen what are statistics, how to view them. I at least I showed you a couple of couple of like histogram itself. That's the main thing that we have to know and what happens when they go wrong when they go wrong. The memory grants that are relocated becomes like um memory grants that are allocated to the query also go wrong. And that will actually have to spill to disc queries, have to sort and do their um hashes in by spilling two discs because the memory is not sufficient. You know, this will cause performance, um you know, performance problems, serious performance problems and how to update these statistics by using the maintenance plans. Even though you update uh you put that auto update stats to be turned on, which is by default, it's turned on.

We have to make sure that we are updating these stats regularly by using the maintenance plans. These are the resources that I followed to prepare the presentation. And I would like to thank one like every one of you. Um And I really apologize. Uh My demo got broke, but then I provided you the link um for 1.5 hour session, which is a very deep type than this. Uh So I hope you will like it and thank you so much for attending my session today. If you have any questions, you can just go ahead and uh ping me um in the chart window. Anyone, any conditions? Oh, thank you so much, Sinead. Uh I'm sorry if I pronounce your name uh wrong and if you can get in touch with me right on Twitter. I have several other presentations that I did. Everything is like a deep type content. Um If you would like to uh get those contents as well. Um I think you can uh personally ping me. I can provide you all the resources. And also I am I'm talking about hr like fundamentals of HR SQL database if you wanted to know uh and learn about hr I'm giving uh some of the presentations in the free conferences. Uh I can even provide you the, you know, registration of all those free conferences you can attend and you know, get that wonderful training, free training, you know. Um Yeah, thank you so much, Aishwarya.

You can go ahead and look into the youtube uh video if you have missed the session. Um So here is a link again. I am typing and uh guys, if you wanted to get in touch with me, uh DB A Nuggets is my um user ID for Twitter and DB A nuggets.com is my website. Um You can, you know, you can subscribe to my website and then get those free uh articles related to SQL server. So, yeah, uh I think I'm a bit early but then, yeah, I would like to give that opportunity for the next speaker to, you know, come in and then um get ready for this session. Thank you so much and I hope um I hope to get in touch with you if you would like to get in touch with me on linkedin, uh just type BP Gogoi in the search window and I get you connected. Thank you so much. Bye.