DBPedias

Your Database Knowledge Community

Patrick O'Keeffe

  1. A Model for SQL Server Performance based on Waits

    Data, data everywhere…

    Project Lucy users that run Spotlight on SQL Server have been uploading SQL Server Waits data since May 2011. We now have hundreds of millions of samples of wait data from thousands of servers. We recently pushed an update to the Project Lucy home page that (amongst other things) displays a breakdown of the SQL Server population by hardware manufacturer type. This visualisation also splits the instances into two groups that display their performance ”health”. To do this we came up with a model that calculated the performance health of a SQL Server instance based on wait data.

    The Model

    The Nature of the data

    We are using wait statistics from SQL Server within this model.  A variant of this model is well known in Oracle circles and we wanted to see if it could be applied to SQL Server.  To do this we needed data, which is lucky because Spotlight on SQL Server Users who have opted in to uploading data to Project Lucy contribute 4 samples of wait stats data for each server that Spotlight monitors every day.  Since our users have been uploading this data since May 2011 we now have a large corpus to play with.

    Modelled Variables

    Currently the model evaluates three variables.

    ”Good” Waits

    CPU wait and I/O Wait – the presumption being that if you are waiting on these at least you are doing ”work”.  CPU wait is simply signal wait, which is a wait for CPU time after having waited on something else. I/O Wait consists of a wait on any of the following wait types – ASYNC_DISKPOOL_LOCK, ASYNC_IO_COMPLETION, CHKPT, DISKIO_SUSPEND, FCB_REPLICA_READ, FCB_REPLICA_WRITE, IO_COMPLETION, LOGBUFFER, LOGMGR, LOGMGR_FLUSH, LOGMGR_RESERVE_APPEND, PAGEIOLATCH_DT, PAGEIOLATCH_EX, PAGEIOLATCH_KP, PAGEIOLATCH_NL, PAGEIOLATCH_SH, PAGEIOLATCH_UP, SOAP_READ, SOAP_WRITE, WRITELOG

    ”Bad” Waits

    ”Bad” waits are the waits that tell you that you are waiting for something that is stopping you from actually doing any useful work. There are a number of categories of ”bad” wait:

    • Lock (LCK_M_*)
    • Latch (LATCH_*, PAGELATCH_*, TRAN_MARK_*, TRAN_MARKLATCH_*)
    • CLR (CLR_*, CLRHOST_*, SQLCLR_*)
    • Other wait types (too numerous to list)

    We do exclude Network (ASYNC_NETWORK_IO, NETWORKIO) from our model, since these are more indicative of client side issues.

    I/O Subsystem Response

    Databases are large data pumps.  They exist to get data into and out of permanent storage as fast as possible.  Performance is definitely going to suffer if the database cannot get pages on or off the disk fast enough, so we need to include in our model a measure of the responsiveness of the I/O subsystem.  We are using Single Page Read latency which is simply the average time per wait for PAGEIOLATCH_* waits.

    In English please…?

    Basically in simple terms we are positing that as long as your single page read latency is low and the majority of time spent waiting is spent waiting on CPU or IO (i.e. you”re actually doing work) then you have no performance bottlenecks.

    Doing the Math

    Now we are going to use our model to calculate a score for a SQL Server. Currently we are using a score of above 80 for everything is OK.  Let”s use as an example, a SQL Server instance that has a single page read latency of 24 ms and spends 50% of total wait time waiting on CPU and 40% of total waiting on IO.

    We start with a score of 100 for Single page read latency. This score is reduced by 20 for each increment of 5ms above 5ms.

    SinglePageReadLatencyvsScore.png

    In the case of our example, the score is 20 because our I/O Single Page Read latency is between 20 and 25.

    We start with a sore of 100 for CPU + IO.  For every increment of 5 that the sum of CPU and IO wait expressed as a percentage of total wait falls we subtract 20.

    CPUPlusIOvsScore.png

    Again in our case, we have figures of 50% and 40% respectively for a total of 90% for CPU + IO as a percentage of total wait.  90% falls into the 95-90 range so the second score becomes 80.

    These two scores are then equally weighted to produce a single score. (20 + 80) / 2 = 50.

    In this case we would argue that the performance of this SQL Server is not OK because even though it is doing work, it cannot get pages on or off the disk fast enough because the I/O subsystem is not fast enough.

     

    Results

    Running our algorithm over the whole corpus gave us the following distribution of performance scores.  As you can see the majority of SQL Servers in the population are OK in terms of performance.

    It is interesting to note the ”hump” around the 50 mark.  This leads us to think that our algorithm could use some refinement.  Perhaps single page read latency is less important when there is minimal I/O occurring, so the weighting should be less for this part of score under these conditions.  Feedback and discussion on this is welcome.

    PerformanceScoreDistribution.png

    Real world examples

    So what does a SQL Server with a ”good” performance score look like?  Here is an example.  This server got a score of 100. As you can see, the only wait is ”good” wait which indicates that the SQL Server is doing actual work and that there is no contention for other resources getting in the way.  In addition, the single page read latency is 1ms which means that there is no I/O subsystem bottleneck.

    WaitScoreDist100Badge.png

    What about a SQL Server with a ”bad” performance score?  Here is an example with a score of 30.  Note the large proportion of time spent waiting on latches.  Whilst this wait is occurring, the time available to do useful work has decreased.  Also note the large single page read latency – 22ms is an unacceptably large amount of time to wait for an I/O to occur.  This server clearly has contention issues and also I/O subsystem issues.

    WaitScoreDist30Badge.png

    Here is an example showing a performance score of 50.  This server has a low single page read latency which is good, but is spending a lot of time waiting on Locks and Latches.  It is possible that the I/O read latency is low because the contention on Locks and Latches means that not enough I/O is being done to saturate the I/O subsystem.

    WaitScoreDist50Badge.png

    In Conclusion

    This model seems to be a reasonable way to quickly gauge whether a SQL Server instance has any performance bottlenecks and what those bottlenecks are.  Note that this model does not make any judgement at this stage whether the work being done is ”useful”.  That is, even if there are no bottlenecks, the response time may not be acceptable because of application design issues.

    We welcome any comment or feedback you might have.

     

  2. Data: SQL Server Version Breakdown for Week Ending 9 September

    This is the latest SQL Server Version breakdown across all Project Lucy users.  This data is derived from users who uploaded trace files and Spotlight on SQL Server uploaders and has a sample size of 960 servers – up from 890 this time last week.  No major changes, but out sample size is growing. We still see that SQL Server 2008 and SQL Server 2008 R2 represent over half the population followed by SQL Server 2005 representing nearly two fifths and with SQL Server 2000 bringing up the rear.

     

    SQLVersion09-09-11.png

    Here is this weeks plot of sample size over time:

    SQLVersionSampleSize09-09-2011.png

  3. Data: SQL Server Version Breakdown for Week Ending 2 September

    This is the latest SQL Server Version breakdown across all Project Lucy users.  This data is derived from users who uploaded trace files and Spotlight on SQL Server uploaders and has a sample size of 890 servers – up from 355 this time last week.  The reason for this increse is that we made some changes that allowed us to consume version data from our Spotlight on SQL Server uploaders.  The picture of our population has changed somewhat with this new data.  We see that SQL Server 2008 and SQL Server 2008 R2 represent over half the population followed by SQL Server 2005 representing nearly two fifths and with SQL Server 2000 bringing up the rear.

    I will be interesting to see what the trend looks like in a few weeks.

     

    SQLVersion02-09-11.png

    Here is this weeks plot of sample size over time:

    SQLVersionSampleSize02-09-2011.png

  4. Data: SQL Server Version Breakdown for Week Ending 2 September

    This is the latest SQL Server Version breakdown across all Project Lucy users.  This data is derived from users who uploaded trace files and Spotlight on SQL Server uploaders and has a sample size of 890 servers – up from 355 this time last week.  The reason for this increse is that we made some changes that allowed us to consume version data from our Spotlight on SQL Server uploaders.  The picture of our population has changed somewhat with this new data.  We see that SQL Server 2008 and SQL Server 2008 R2 represent over half the population followed by SQL Server 2005 representing nearly two fifths and with SQL Server 2000 bringing up the rear.

    I will be interesting to see what the trend looks like in a few weeks.

     

    SQLVersion02-09-11.png

    Here is this weeks plot of sample size over time:

    SQLVersionSampleSize02-09-2011.png

  5. Data: Distribution of IO Stall for All SQL Server Uploaders

    This is some more data that we thought you might find interesting.

    Have you ever wondered what I/O wait times SQL Server users were experiencing? Wonder no more – now we can begin to tell you.

    This is the distribution of I/O Stall times across all Project Lucy users.  This data is derived from users who use Spotlight on SQL Server to upload data to Project Lucy and the sample size is 102822 distinct data and log files.  This data shows that as we would expect, the majority of IO”s performed across our user population are waiting less than 2.5 milliseconds – 2.5 ms per I/O means 400 I/Os per second.  It would be interesting to split these numbers into log and data files, but we are not in a position to do this just yet (we will be able to in a few weeks).

     

    IOStallDist26-08-2011.png

    By way of explanation, the source of the data for this is the sys.dm_io_virtual_file_stats DMV. Spotlight collects the io_stall_read_ms and io_stall_write_ms columns. These are then divided by the num_of_reads and num_of_writes values respectively to get the average stall time per IO.

  6. Data: Distribution of IO Stall for All SQL Server Uploaders

    This is some more data that we thought you might find interesting.

    Have you ever wondered what I/O wait times SQL Server users were experiencing? Wonder no more – now we can begin to tell you.

    This is the distribution of I/O Stall times across all Project Lucy users.  This data is derived from users who use Spotlight on SQL Server to upload data to Project Lucy and the sample size is 102822 distinct data and log files.  This data shows that as we would expect, the majority of IO”s performed across our user population are waiting less than 2.5 milliseconds – 2.5 ms per I/O means 400 I/Os per second.  It would be interesting to split these numbers into log and data files, but we are not in a position to do this just yet (we will be able to in a few weeks).

     

    IOStallDist26-08-2011.png

    By way of explanation, the source of the data for this is the sys.dm_io_virtual_file_stats DMV. Spotlight collects the io_stall_read_ms and io_stall_write_ms columns. These are then divided by the num_of_reads and num_of_writes values respectively to get the average stall time per IO.

  7. Data: SQL Server Version Breakdown for Week Ending 26 August

    This is the latest SQL Server Version breakdown across all Project Lucy users.  This data is derived from users who uploaded trace files and has a sample size of 355 servers – up from 352 this time last week.  We can see that the percentage of SQL Server 2008 is still rising and the percentage of SQL Server 2000 is falling – albeit slightly.

     

    SQLVersion26-08-11.png

    Here is this weeks plot of sample size over time:

    SQLVersionSampleSize26-08-2011.png

  8. Data: SQL Server Version Breakdown for Week Ending 26 August

    This is the latest SQL Server Version breakdown across all Project Lucy users.  This data is derived from users who uploaded trace files and has a sample size of 355 servers – up from 352 this time last week.  We can see that the percentage of SQL Server 2008 is still rising and the percentage of SQL Server 2000 is falling – albeit slightly.

     

    SQLVersion26-08-11.png

    Here is this weeks plot of sample size over time:

    SQLVersionSampleSize26-08-2011.png

  9. Data: SQL Server Version Breakdown for Week Ending 19 August

    This is the third in a series of posts that publish some data that we find interesting.

    This is the SQL Server Version breakdown across all Project Lucy users.  This data is derived from users who uploaded trace files and has a sample size of 352 servers – up from 341 this time last week.  We can see that the percentage of SQL Server 2008 is still rising and the percentage of SQL Server 2000 is falling – albeit slightly.

     

    SQLVersion19-08-11.png

    Here is this week”s plot of sample size over time:

    SQLVersionSampleSize19-08-2011.png

  10. Data: SQL Server Version Breakdown for Week Ending 19 August

    This is the third in a series of posts that publish some data that we find interesting.

    This is the SQL Server Version breakdown across all Project Lucy users.  This data is derived from users who uploaded trace files and has a sample size of 352 servers – up from 341 this time last week.  We can see that the percentage of SQL Server 2008 is still rising and the percentage of SQL Server 2000 is falling – albeit slightly.

     

    SQLVersion19-08-11.png

    Here is this week”s plot of sample size over time:

    SQLVersionSampleSize19-08-2011.png

  1. 1
  2. Next ›
  3. Last »