Oracle 11g: Enhancements to DBMS_STATS

Many of you are aware of the Oracle 11g Database New Features and while some may be generally interested in new features, one area that I focus on is new features that yield gains in performance. Some of these features can be found in the General Server Performance section of the Oracle 11g Database New Features documentation. There is one area (for now…) that didn’t make this list but I feel is worth mentioning – performance enhancements made to DBMS_STATS.

The Necessity of Representative Statistics
Representative statistics are the foundation that the Optimizer relies on to make the best decisions when choosing execution plans. One recent blog post from Don Seiler, with the help of Wolfgang Breitling, is a prefect real-world case. This blog post dealt with out-of-range values, but one other case that often causes headaches is data skew. In the Real-World Performance Roundtable, Part II session at OracleWorld 2006, I discussed a basic stats gathering strategy that dealt with the exception case of data skew. When using the DBMS_STATS default of DBMS_STATS.AUTO_SAMPLE_SIZE in 10g and 9i, the NDV (Number of Distinct Values) may be statistically inaccurate when there is significant data skew. In order to deal with this exception, a fixed percentage of data that yields statistically representative NDV counts should be chosen.

11g DBMS_STATS
In 11g there have been some enhancements made to the DBMS_STATS package. Overall the GATHER_* processes run faster but what stands out to me is the speed and accuracy that DBMS_STATS.AUTO_SAMPLE_SIZE now gives. As a performance person, I often times make reference to letting the numbers tell the story, so lets dive into a comparison between 10.2.0.3 and 11.1.0.5.

I’ve chosen the same data set that I used in the “Refining the Stats” section of Real-World Performance Roundtable, Part II session. Stats were serially gathered with ESTIMATE_PERCENT of 10%, 100%, and DBMS_STATS.AUTO_SAMPLE_SIZE.

10.2.0.3
run#
AUTO_SAMPLE_SIZE
10%
100%
1
00:07:53.97
00:04:18.87
00:09:22.15
2
00:09:06.09
00:04:18.95
00:09:13.28
3
00:07:46.23
00:03:52.50
00:09:18.11
4
00:07:55.43
00:04:02.94
00:09:20.54
5
00:09:43.30
00:03:49.96
00:09:16.38

11.1.0.5
run#
AUTO_SAMPLE_SIZE
10%
100%
1
00:02:39.31
00:02:38.55
00:07:37.83
2
00:02:21.86
00:02:31.56
00:08:24.10
3
00:02:38.11
00:02:49.49
00:07:38.25
4
00:02:26.60
00:02:27.75
00:07:42.25
5
00:02:29.95
00:02:29.45
00:07:42.49


11g DBMS_STATS Observations
As you can see by the numbers, 11g pulls a win in each of the three GATHER_TABLE_STATS calls. Take note of the AUTO_SAMPLE_SIZE timings. The 11g AUTO_SAMPLE_SIZE gather takes the same time as the 11g 10% sample. Not bad!

NDV Accuracy
We’ve seen that the 11g gather stats is overall faster and that the 11g AUTO_SAMPLE_SIZE shows a significant improvement in speed compared to 10.2.0.3 AUTO_SAMPLE_SIZE for this table, but how do the NDV calculations compare? Again, let’s look at the numbers. I’ve queried USER_TAB_COL_STATISTICS to get the NDV and SAMPLE_SIZE for our skewed data set.

10.2.0.3

ESTIMATE_PERCENT => 10
COLUMN_NAME     NUM_DISTINCT  NUM_NULLS SAMPLE_SIZE
--------------- ------------ ---------- -----------
C1                     31464          0     2148910
C2                    608544          0     2148910
C3                    359424          0     2148910

ESTIMATE_PERCENT => 100%
COLUMN_NAME     NUM_DISTINCT  NUM_NULLS SAMPLE_SIZE
--------------- ------------ ---------- -----------
C1                     60351          0    21456269
C2                   1289760          0    21456269
C3                    777942          0    21456269

ESTIMATE_PERCENT => DBMS_STATS.AUTO_SAMPLE_SIZE
COLUMN_NAME     NUM_DISTINCT  NUM_NULLS SAMPLE_SIZE
--------------- ------------ ---------- -----------
C1                      1787          0        5823
C2                    367075          0      576909
C3                     52464          0       57431

11.1.0.5

ESTIMATE_PERCENT => 10
COLUMN_NAME     NUM_DISTINCT  NUM_NULLS SAMPLE_SIZE
--------------- ------------ ---------- -----------
C1                     31320          0     2147593
C2                    608814          0     2147593
C3                    359365          0     2147593

ESTIMATE_PERCENT => 100
COLUMN_NAME     NUM_DISTINCT  NUM_NULLS SAMPLE_SIZE
--------------- ------------ ---------- -----------
C1                     60351          0    21456269
C2                   1289760          0    21456269
C3                    777942          0    21456269

ESTIMATE_PERCENT => DBMS_STATS.AUTO_SAMPLE_SIZE
COLUMN_NAME     NUM_DISTINCT  NUM_NULLS SAMPLE_SIZE
--------------- ------------ ---------- -----------
C1                     59852          0    21456269
C2                   1270912          0    21456269
C3                    768384          0    21456269

As expected, the 100% samples are identical and the 10% samples are statistically equivalent. One interesting data point is that the SAMPLE_SIZE for the 11g AUTO_SAMPLE_SIZE run shows the exact SAMPLE_SIZE as the 100% gather – the total number of rows in the table. Also note that the NDV counts for the 11g AUTO_SAMPLE_SIZE gather are statistically equivalent to the 100% sample. What does this mean? It means that the 11g AUTO_SAMPLE_SIZE had been enhanced to provide nearly 100% sample accuracy, even on skewed data sets.

Summary
Overall the 11g DBMS_STATS has been enhanced to gather stats in less time, but in my opinion the significant enhancement is to AUTO_SAMPLE_SIZE which yields near 100% sample accuracy in 10% sample time.

As the documentation says:

…Oracle recommends setting the ESTIMATE_PERCENT parameter of the DBMS_STATS gathering procedures to DBMS_STATS.AUTO_SAMPLE_SIZE to maximize performance gains while achieving necessary statistical accuracy.

I couldn’t agree with the documentation more.

If you wish to know more about how the new DBMS_STATS.AUTO_SAMPLE_SIZE works, see section 3 of Efficient and scalable statistics gathering for large databases in Oracle 11g.

21 comments

  1. Alan Patil

    Conversely are you saying that AUTO_SAMPLE_SIZE shouldn’t be used on 10g as the stats dont look very good in terms of num_distinct (I have noticed this in some of our 10g databases using the default stats collection job).

    Alan

  2. Greg Rahn

    I am not saying don’t use AUTO_SAMPLE_SIZE in 10g (and 9i), but you may need to adjust your stats gathering strategy if you 1) have skewed data and 2) AUTO_SAMPLE_SIZE is not yielding representative NDV. See the presentations from Oracle OpenWorld 2006. It discusses a high level strategy for stats gathering. The basic approach is: Start with AUTO_SAMPLE_SIZE and adjust as necessary.

  3. Pingback: Oracle 11g: Extended Statistics | Structured Data
  4. Pingback: Troubleshooting Bad Execution Plans | Structured Data
  5. Alan Patil

    Thanks for your reply. However I feel that with the AUTO_SAMPLE_SIZE parameter its kind of an all or nothing approach, either Oracle does the stats using a sample size the Oracle developers choose or you do your own stats collection process. It would be nice if in the next version of Oracle they had an additional parameter/s so that you could tune AUTO_SAMPLE_SIZE to your own circumstance, i.e. for a 24×7 system under heavy load you might want smaller sample sizes as you might want the stats job to run very quickly. Alternatively you might have a system which was completely offline at the weekend so you could afford for the stats job to run all weekend. The extra parameter could be say a value between 1 and a 100 with a 100 representing 100% sample for everything down to 1 suggesting a much smaller sample percentage for big tables. You could then tune this value to the amount of time you have available. You could also have a more sophisticated version where you specify your analyze window in terms of time and then AUTO_SAMPLE_SIZE alters its sample size depending on the number of tables, their size etc taking into account that it should complete within the time specified.

    Alan

  6. Greg Rahn

    Alan-

    I’m uncertain why you feel DBMS_STATS.AUTO_SAMPLE_SIZE is an all or nothing approach. Personally I think it is the only approach for 11g. In 11g, it samples all of the data very fast and very accurate. DBMS_STATS has a parameter named ESTIMATE_PRECENT and it can have any value from 0.000001 to 100. It can also take the value of DBMS_STATS.AUTO_SAMPLE_SIZE. The objective of DBMS_STATS.AUTO_SAMPLE_SIZE is to allow DBMS_STATS to find a sample size that yields accurate NDV utilizing the smallest sample possible. If you know your data well enough and want to choose a sample size, feel free. Understand one thing though, it is quite likely that your data will change over time and the sample size that yields accurate NDV today, may not be the size that works down the road. This is the advantage that DBMS_STATS.AUTO_SAMPLE_SIZE has. It should adjust as your data does without manual intervention.

  7. Alan Patil

    Thanks for your reply Greg. What I meant by an all or nothing approach is I would like to use dbms_stats.gather_database_stats() with auto_sample_size for ALL objects. I dont really want to write my own procedure to call dbms_stats for individual objects due to their own particular peculiarities (as I have done in the past). The only problem I have with auto_sample_size is how do the Oracle developers know 1) how long I can afford to have dbms_stats running for 2) how does it know how to generate accurate stats in ALL cases.

    That is why I suggested 1) giving dbms_stats a time limit and or having a second parameter to act as a bias so you can tweak the auto_sample_size algorithm to suit your system. Thinking about the problem a bit more it would be nice if you could have a ALTER TABLE xxx stats() command to give dbms_stats some hints as to how the table needs to be analyzed i.e. histograms are required, use a minimum sample percentage, stale table percentage, always sample these two columns together as there is a correlation between them etc.

  8. Greg Rahn

    Alan-

    I completely understand you don’t want a custom stats gathering procedure, but the reality of it is until 11g, DBMS_STATS.AUTO_SAMPLE_SIZE may not yield accurate NDV on skewed data. Just like almost everything else, there are cases that need to be dealt with on an exception basis. I’m quite certain people don’t want to use hints in SQL, but again, the reality is it’s needed sometimes.

  9. Alan Patil

    Hi Greg, my point is that I dont really believe there will ever be 100% accurate stats for every possible situation. So what I am suggesting is a way of tweaking the stats collection methodology without writing code. So the intelligence which I currently put into a procedure should really be put into metadata that the DBA can populate i.e. using an alter table command might be one way. Thus when I come across a case where the default stats collection method doesnt work I just tweak some metadata, I dont scrap my stats collection method and write a new one from scratch. Maybe one for 12x :)

  10. Greg Rahn

    Alan-

    Stats collection doesn’t need to be 100% accurate, it just needs to be statistically accurate, such that the optimal plans are chosen. The DBMS_STATS.AUTO_SAMPLE_SIZE in 11g does just this. 11g also allows you to set the stale percent (see DBMS_STATS) as well as use private stats so that one could validate a plan before publishing it. If these new features don’t cover your issues I’d be interested in knowing more.

  11. Pingback: Choosing An Optimal Stats Gathering Strategy | Structured Data
  12. Saul

    Greg,

    First off, good article and I agree (and practice) with the approach spelled out. One thing, I would prefer to find skewed data proactively rather than reactively (after a SQL stmt goes bad). Since my environment is still on 10g, I thought about developing a script that calculates Pearson’s skewness formula (MEAN – MODE)/STDDEV on columns identified in SYS.COL_USAGE$. Not sure that is the best approach or whether I would be reinventing the wheel, so wanted to get your thoughts.

  13. Greg Rahn

    Saul-

    I understand trying to be proactive versus reactive, but I think it may be a case of “you don’t know what you don’t know”, until you know.

    You can try proactively searching for it, but you may find that a good BI person or data architect could tell you where such scenarios may occur and then just do a light investigation. People that work with the data each and every day have a pretty good understanding of it, I’ve found.

  14. Pingback: Oracle 11g: Incremental Global Statistics On Partitioned Tables | Structured Data
  15. Pingback: Oracle 11g, Istogrammi e DBMS_STATS.AUTO_SAMPLE_SIZE « Oracle and other
  16. Pingback: Kerry Osborne’s Oracle Blog » Blog Archive I’ll Gladly Pay You Tuesday for a Hamburger Today - Kerry Osborne’s Oracle Blog
  17. Eric

    Hi Greg,

    Do you by any chance still have a copy of the paper – “Efficient and scalable statistics gathering for large databases in Oracle 11g”? I try to read it through.

    Thanks

  18. Pingback: Ошибочная оценка NDV столбцов с параметром DBMS_STATS.AUTO_SAMPLE_SIZE в версии 10.2 « Oracle mechanics
  19. Pingback: Old Habits | Oracle Database 101
  20. peterwiseman

    I’ve had an interesting discovery today that raises a question over my initial confirmation that 11g takes 10% of the time and high quality statistics. I naively expected a gather to take 10% of the time of a full table scan – it didn’t. So I turned on tracing. The first query is performing a full tablescan with no sampling to obtain column counts and min/max values. Subsequent queries for index stats then use sampling so are a lot quicker. I don’t have histograms enabled, and I expect any subsequent queries to gather histogram values would also be sampled. So overall, AUTO is quicker than 100% sampling, though only in the case where you are gathering index and/or histogram statistics.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s