Managing Optimizer Statistics Paper

Over the past couple days I’ve been reading through the recent paper by Karen Morton entitled “Managing Statistics for Optimal Query Performance“. In this paper Karen goes over many of the topics I have discussed as well (and a few that I have not) in the following blog posts:

Overall I think Karen does a good job discussing the key issues related to object statistics and has examples to assist in understanding where and why issues can arise. I’m also a fan of works that do not project an answer, but rather provide information, how to use that information, use working examples, and ultimately leaves the reader with the task of how to best apply the new found knowledge to their environment. I would recommend to all to give it a read.

Comments On Dynamic Sampling

In section 5.1 on page 22 Karen discusses using dynamic sampling, of particular interest she has a table of statistics from Robyn Sands that compares two different runs of some application jobs both using optimizer_dynamic_sampling=4. One uses stats collected with a 100% sample. The other uses no static stats at all, only using dynamic sampling. The data that Robyn has captured looks interesting, but I’m not exactly sure how to interpret all of it (there are a lot of numbers in that chart) and only two sentences that speak to the data. I think I understand what message behind chart but I think it would be easier to see the median along with the average so that one could clearly see there is a large range of values, thus indicating there are numbers far away from the median indicating some vast difference. This type of comparison (before change/after change) is exactly what the Oracle Database 11g SQL Performance Analyzer (SPA) was designed for. One of the reasons I think SQL Performance Analyzer is a great tool for controlled execution and comparison of SQL statements is that not only collects all the execution metrics, but also it captures the execution plan. Looking at the data from Robyn leaves me with the question: For the SQL statements executed, which ones had a plan change, and was that plan change for the better or for worse? The last comment I would make is that while relying 100% on dynamic sampling (having no object statistics) is an interesting data point, I would not recommend this for a production database. Dynamic sampling was designed to augment static statistics and to provide some statistics to the query optimizer in the cases where none exist and only guesses (or selectivity constants) would otherwise be used. The main logic behind this recommendation is that dynamic sampling does not provide all metadata that static statistics provide, it only provides a subset. While this subset of metadata may be enough to get the equivalent plan at times, frequently it is not.

Remember The Chain of Events

Karen mentions (page 5) something that was in the Oracle OpenWorld 2008 session “Real-World Database Performance Techniques and Methods” (slide 29):

Slide29.jpg

Having a good understanding of this chain of events should help you in getting to the root cause of poor execution plans. The key takeaway is to recognize that without representative statistics, the optimizer has very little chance to choose the best execution plan, so give it the best chance you can.

Additional Resources

If you have not looked through the slides from “Real-World Database Performance Techniques and Methods” I might suggest you do. There is some great information based on the Real-World Performance Group’s experience that centers around this topic. Especially note the sections:

  • Optimizer Exposé (slide 7)
  • Managing Statistics on Partitioned Tables (slide 34)

Hopefully you find the information in those slides useful.

7 comments

  1. Karen Morton

    Hopefully Robyn will comment more on her variance findings but I thought I’d mention how I came to include her test findings in my paper. While in Denmark together last fall, I mentioned my interest in how dynamic sampling would fair against other collections at various sample sizes in producing execution plans that performed at least as well as, if not better than, plans derived from static collections. She was conducting application tests that would make it fairly easy for her to inject one more test and said she’d be happy to try out dynamic sampling and let me know how it fared. I included just a small bit of her data so I could mention that dynamic sampling did fare quite well in her specific example.

    While I certainly wouldn’t advocate that one set of tests should be used to prove dynamic sampling “should be” used, it did confirm my belief that dynamic sampling can perform admirably if/when it is used. So your point about not recommending the use of dynamic sampling in production in place of actual stats collections is well taken and I agree wholeheartedly.

    Also, thanks for the links to your related posts and to the OpenWorld 2008 paper. I had reviewed and footnoted a paper from OpenWorld 2005 and will now also go back and include a reference to the 2008 one and your posts as well.

  2. robyn sands

    Greg,

    A little background first. I had a test environment set up to measure for consistency in response time. I’m using the ILO package to record the elapsed time for a set of summarization processes, and then calculating some basic statistics (mean, median, mode, standard deviation and variance) on the elapsed time results for multiple executions with different sized data sets. The goal is to identify the combination of factors that result in achieving our performance targets consistently while still in the application design process. I’ve found that high levels of variance in relation to the average response time can indicate performance issues, usually related to suboptimal code or changing execution plans. Every once in while, it ends up being a bug but not too often …

    In the test and production environments for this application, only the summarization processes are instrumented with ILO. In production, I gather 100% statistics on each partition and it’s indexes one time after the data is first populated. After that, the data for that partition remains unchanged and the execution plans remain consistent within a reasonable level of change in data quantity across partitions.

    I presented my material at Lalandia in October, and Karen and I chatted about it while waiting for our flight(s) homes. Later I executed 4 different test scenarios for her, two of which she quoted in the paper. We were curious about two things:

    1. Would the summarization processes maintain consistent response times using dynamically gathered statistics ?

    2. Would the act of dynamically sampling itself increase the variance in the processing time ?

    I executed 6 sets of 11 processes and each set contained between about 7 and 12 million records with varying levels of skew in the data. We use timestamps and process the date to the minute, so we’re fairly sensitive to inaccurate statistics, particularly in relation to high and low values. I didn’t check the individual execution plans; at this stage of testing, I’m just replicating the nightly summarization process without interruption.

    Overall, I was pleasantly surprised by the results. There will always be some level of variance in processing, so I’m generally looking for a statistically significant shift in variance to indicate a problem or an improvement, and neither the collection of statistics during the summarization window or the use of the dynamically gathered statistics created an order of magnitude increase in response time variance. I’m not going to change our production application to rely completely on dynamic sampling at this time, but it had a low impact with good results when used for longer running processes. (which I believe is consistent with your previous postings)

    If you’d like more info or the full test results, I can send them to you. The results from the other two tests were not significantly different but I still have those as well.

    best regards … Robyn

  3. Greg Rahn

    @Karen Morton
    I think we are on the same page when it comes to dynamic sampling. The reason I wanted to make a specific point about relying only on dynamic sampling was that all to often I have seen data misinterpreted. With your case I wanted to make sure that your comments about the dynamic sampling runs having less variance were not interpreted as using dynamic sampling with no stats produced equal to or better results than using dynamic sampling with static stats using a 100% sample, thus raising the question: Why gather static stats?

  4. Greg Rahn

    @robyn sands
    Thanks for the comments but I have a few questions:
    1) What does ILO stand for?
    2) WRT the numbers in the chart in Karen’s paper, do any of these jobs exhibit a large variance (or relatively large)? What numbers denote this?
    3) How would a process that does have large variance manifest itself in the numbers? (are variance numbers close to zero better, etc?) Or perhaps better put, if someone knew nothing of statistics, what would be your explanation of how to interpret the meaning of those numbers.

  5. robyn sands

    @Greg Rahn
    1. ILO is the Instrumentation Library for Oracle, available at SourceForge.net and maintained by Method-R. I’ve used data from job scheduling systems for this type of analysis as well.

    2 & 3. This is a little challenging to answer in a comment, but I’ll give it a shot:

    Variance is a measure of the consistency of a processes execution time. The less variance in a process, the more consistent the execution time. So numbers approaching 0 are better and higher numbers are worse. I am making no attempts to state that variance over X is bad and I really emphasize that point in my presentations.

    The test environment referenced in Karen’s paper was intended to measure changes in response time and response time variance as conditions in the database were altered. Each process is measured against itself.

    These measurements were taken at the end of the testing cycle: variance and elapsed time had already been reduced to acceptable levels. In the first set of tests, one of the processes had an average elapsed time of 4109.6 seconds with variance of 4755461.77 over 6 executions. It was reduced to an average elapsed time of 116.8 seconds with a variance of 1833.248, again, 6 executions with the same data set. This would be an order of magnitude reduction. (actually several orders – other processes improved but not by that much) In that context, the changes in variance due to the use of dynamic sampling were very minimal.

    I have also used variance to target processes within a batch for performance tuning and it can be an effective tool in that context. I rank the jobs by variance to mean ratio (VMR), in descending order, so the highest VMR becomes my first target for investigation. I could use just variance but that hides problem jobs with shorter execution times.

    Using the data in Karen’s paper, job number 1 would be first, followed by jobs 6 and 8. I usually get asked why I don’t use standard deviation, or the comparison of standard deviation to the average, but I found that these numbers lead me to jobs that process predictably different data quantities. VMR tends to be high in jobs that have random performance issues which then suck up system resources needed by the other jobs. I’ve used our data warehouses to test the concept and it’s been pretty effective thus far.

    Hopefully, I haven’t created more questions than I’ve answered but I’ll watch for more comments just in case …

  6. Greg Rahn

    @robyn sands

    robyn sands :
    The less variance in a process, the more consistent the execution time. So numbers approaching 0 are better and higher numbers are worse.

    I think these two sentences explain how to make a quick interpretation of the numbers.

    Thanks again for your comments.

  7. Kerry Osborne

    I have to agree with Greg, the chart is confusing. I think a standard deviation or normalized stddev is much easier to understand. Also, elapsed time will have variance due to a lot of factors, contention for resources due to varying workload for one thing and especially due to how many of the data blocks happen to be cached at the time the query is executed. Finally, it seems to me that large variances are probably an indicator that plans are changing (which may be the whole point of what you’re trying to point out). But it’s pretty easy to see that by simply looking directly at the plan_hash value in either the the shared pool (v$sql) or in AWR (dba_hist_sqlstats).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s