Oracle 11g: Enhancements to DBMS_STATS
Many of you are aware of the Oracle 11g Database New Features and while some may be generally interested in new features, one area that I focus on is new features that yield gains in performance. Some of these features can be found in the General Server Performance section of the Oracle 11g Database New Features documentation. There is one area (for now…) that didn’t make this list but I feel is worth mentioning – performance enhancements made to DBMS_STATS.
The Necessity of Representative Statistics
Representative statistics are the foundation that the Optimizer relies on to make the best decisions when choosing execution plans. One recent blog post from Don Seiler, with the help of Wolfgang Breitling, is a prefect realworld case. This blog post dealt with outofrange values, but one other case that often causes headaches is data skew. In the RealWorld Performance Roundtable, Part II session at OracleWorld 2006, I discussed a basic stats gathering strategy that dealt with the exception case of data skew. When using the DBMS_STATS default of DBMS_STATS.AUTO_SAMPLE_SIZE in 10g and 9i, the NDV (Number of Distinct Values) may be statistically inaccurate when there is significant data skew. In order to deal with this exception, a fixed percentage of data that yields statistically representative NDV counts should be chosen.
11g DBMS_STATS
In 11g there have been some enhancements made to the DBMS_STATS package. Overall the GATHER_* processes run faster but what stands out to me is the speed and accuracy that DBMS_STATS.AUTO_SAMPLE_SIZE now gives. As a performance person, I often times make reference to letting the numbers tell the story, so lets dive into a comparison between 10.2.0.3 and 11.1.0.5.
I’ve chosen the same data set that I used in the “Refining the Stats” section of RealWorld Performance Roundtable, Part II session. Stats were serially gathered with ESTIMATE_PERCENT of 10%, 100%, and DBMS_STATS.AUTO_SAMPLE_SIZE.
10.2.0.3  
run#

AUTO_SAMPLE_SIZE

10%

100%

1

00:07:53.97

00:04:18.87

00:09:22.15

2

00:09:06.09

00:04:18.95

00:09:13.28

3

00:07:46.23

00:03:52.50

00:09:18.11

4

00:07:55.43

00:04:02.94

00:09:20.54

5

00:09:43.30

00:03:49.96

00:09:16.38

11.1.0.5  
run#

AUTO_SAMPLE_SIZE

10%

100%

1

00:02:39.31

00:02:38.55

00:07:37.83

2

00:02:21.86

00:02:31.56

00:08:24.10

3

00:02:38.11

00:02:49.49

00:07:38.25

4

00:02:26.60

00:02:27.75

00:07:42.25

5

00:02:29.95

00:02:29.45

00:07:42.49

11g DBMS_STATS Observations
As you can see by the numbers, 11g pulls a win in each of the three GATHER_TABLE_STATS calls. Take note of the AUTO_SAMPLE_SIZE timings. The 11g AUTO_SAMPLE_SIZE gather takes the same time as the 11g 10% sample. Not bad!
NDV Accuracy
We’ve seen that the 11g gather stats is overall faster and that the 11g AUTO_SAMPLE_SIZE shows a significant improvement in speed compared to 10.2.0.3 AUTO_SAMPLE_SIZE for this table, but how do the NDV calculations compare? Again, let’s look at the numbers. I’ve queried USER_TAB_COL_STATISTICS to get the NDV and SAMPLE_SIZE for our skewed data set.
10.2.0.3
ESTIMATE_PERCENT => 10 COLUMN_NAME NUM_DISTINCT NUM_NULLS SAMPLE_SIZE     C1 31464 0 2148910 C2 608544 0 2148910 C3 359424 0 2148910 ESTIMATE_PERCENT => 100% COLUMN_NAME NUM_DISTINCT NUM_NULLS SAMPLE_SIZE     C1 60351 0 21456269 C2 1289760 0 21456269 C3 777942 0 21456269 ESTIMATE_PERCENT => DBMS_STATS.AUTO_SAMPLE_SIZE COLUMN_NAME NUM_DISTINCT NUM_NULLS SAMPLE_SIZE     C1 1787 0 5823 C2 367075 0 576909 C3 52464 0 57431
11.1.0.5
ESTIMATE_PERCENT => 10 COLUMN_NAME NUM_DISTINCT NUM_NULLS SAMPLE_SIZE     C1 31320 0 2147593 C2 608814 0 2147593 C3 359365 0 2147593 ESTIMATE_PERCENT => 100 COLUMN_NAME NUM_DISTINCT NUM_NULLS SAMPLE_SIZE     C1 60351 0 21456269 C2 1289760 0 21456269 C3 777942 0 21456269 ESTIMATE_PERCENT => DBMS_STATS.AUTO_SAMPLE_SIZE COLUMN_NAME NUM_DISTINCT NUM_NULLS SAMPLE_SIZE     C1 59852 0 21456269 C2 1270912 0 21456269 C3 768384 0 21456269
As expected, the 100% samples are identical and the 10% samples are statistically equivalent. One interesting data point is that the SAMPLE_SIZE for the 11g AUTO_SAMPLE_SIZE run shows the exact SAMPLE_SIZE as the 100% gather – the total number of rows in the table. Also note that the NDV counts for the 11g AUTO_SAMPLE_SIZE gather are statistically equivalent to the 100% sample. What does this mean? It means that the 11g AUTO_SAMPLE_SIZE had been enhanced to provide nearly 100% sample accuracy, even on skewed data sets.
Summary
Overall the 11g DBMS_STATS has been enhanced to gather stats in less time, but in my opinion the significant enhancement is to AUTO_SAMPLE_SIZE which yields near 100% sample accuracy in 10% sample time.
As the documentation says:
…Oracle recommends setting the ESTIMATE_PERCENT parameter of the DBMS_STATS gathering procedures to DBMS_STATS.AUTO_SAMPLE_SIZE to maximize performance gains while achieving necessary statistical accuracy.
I couldn’t agree with the documentation more.
If you wish to know more about how the new DBMS_STATS.AUTO_SAMPLE_SIZE works, see section 3 of Efficient and scalable statistics gathering for large databases in Oracle 11g.
Conversely are you saying that AUTO_SAMPLE_SIZE shouldn’t be used on 10g as the stats dont look very good in terms of num_distinct (I have noticed this in some of our 10g databases using the default stats collection job).
Alan
I am not saying don’t use AUTO_SAMPLE_SIZE in 10g (and 9i), but you may need to adjust your stats gathering strategy if you 1) have skewed data and 2) AUTO_SAMPLE_SIZE is not yielding representative NDV. See the presentations from Oracle OpenWorld 2006. It discusses a high level strategy for stats gathering. The basic approach is: Start with AUTO_SAMPLE_SIZE and adjust as necessary.
Thanks for your reply. However I feel that with the AUTO_SAMPLE_SIZE parameter its kind of an all or nothing approach, either Oracle does the stats using a sample size the Oracle developers choose or you do your own stats collection process. It would be nice if in the next version of Oracle they had an additional parameter/s so that you could tune AUTO_SAMPLE_SIZE to your own circumstance, i.e. for a 24×7 system under heavy load you might want smaller sample sizes as you might want the stats job to run very quickly. Alternatively you might have a system which was completely offline at the weekend so you could afford for the stats job to run all weekend. The extra parameter could be say a value between 1 and a 100 with a 100 representing 100% sample for everything down to 1 suggesting a much smaller sample percentage for big tables. You could then tune this value to the amount of time you have available. You could also have a more sophisticated version where you specify your analyze window in terms of time and then AUTO_SAMPLE_SIZE alters its sample size depending on the number of tables, their size etc taking into account that it should complete within the time specified.
Alan
Alan
I’m uncertain why you feel DBMS_STATS.AUTO_SAMPLE_SIZE is an all or nothing approach. Personally I think it is the only approach for 11g. In 11g, it samples all of the data very fast and very accurate. DBMS_STATS has a parameter named ESTIMATE_PRECENT and it can have any value from 0.000001 to 100. It can also take the value of DBMS_STATS.AUTO_SAMPLE_SIZE. The objective of DBMS_STATS.AUTO_SAMPLE_SIZE is to allow DBMS_STATS to find a sample size that yields accurate NDV utilizing the smallest sample possible. If you know your data well enough and want to choose a sample size, feel free. Understand one thing though, it is quite likely that your data will change over time and the sample size that yields accurate NDV today, may not be the size that works down the road. This is the advantage that DBMS_STATS.AUTO_SAMPLE_SIZE has. It should adjust as your data does without manual intervention.
Thanks for your reply Greg. What I meant by an all or nothing approach is I would like to use dbms_stats.gather_database_stats() with auto_sample_size for ALL objects. I dont really want to write my own procedure to call dbms_stats for individual objects due to their own particular peculiarities (as I have done in the past). The only problem I have with auto_sample_size is how do the Oracle developers know 1) how long I can afford to have dbms_stats running for 2) how does it know how to generate accurate stats in ALL cases.
That is why I suggested 1) giving dbms_stats a time limit and or having a second parameter to act as a bias so you can tweak the auto_sample_size algorithm to suit your system. Thinking about the problem a bit more it would be nice if you could have a ALTER TABLE xxx stats() command to give dbms_stats some hints as to how the table needs to be analyzed i.e. histograms are required, use a minimum sample percentage, stale table percentage, always sample these two columns together as there is a correlation between them etc.
Alan
I completely understand you don’t want a custom stats gathering procedure, but the reality of it is until 11g, DBMS_STATS.AUTO_SAMPLE_SIZE may not yield accurate NDV on skewed data. Just like almost everything else, there are cases that need to be dealt with on an exception basis. I’m quite certain people don’t want to use hints in SQL, but again, the reality is it’s needed sometimes.
Hi Greg, my point is that I dont really believe there will ever be 100% accurate stats for every possible situation. So what I am suggesting is a way of tweaking the stats collection methodology without writing code. So the intelligence which I currently put into a procedure should really be put into metadata that the DBA can populate i.e. using an alter table command might be one way. Thus when I come across a case where the default stats collection method doesnt work I just tweak some metadata, I dont scrap my stats collection method and write a new one from scratch. Maybe one for 12x :)
Alan
Stats collection doesn’t need to be 100% accurate, it just needs to be statistically accurate, such that the optimal plans are chosen. The DBMS_STATS.AUTO_SAMPLE_SIZE in 11g does just this. 11g also allows you to set the stale percent (see DBMS_STATS) as well as use private stats so that one could validate a plan before publishing it. If these new features don’t cover your issues I’d be interested in knowing more.
Greg,
First off, good article and I agree (and practice) with the approach spelled out. One thing, I would prefer to find skewed data proactively rather than reactively (after a SQL stmt goes bad). Since my environment is still on 10g, I thought about developing a script that calculates Pearson’s skewness formula (MEAN – MODE)/STDDEV on columns identified in SYS.COL_USAGE$. Not sure that is the best approach or whether I would be reinventing the wheel, so wanted to get your thoughts.
Saul
I understand trying to be proactive versus reactive, but I think it may be a case of “you don’t know what you don’t know”, until you know.
You can try proactively searching for it, but you may find that a good BI person or data architect could tell you where such scenarios may occur and then just do a light investigation. People that work with the data each and every day have a pretty good understanding of it, I’ve found.
Hi Greg,
Do you by any chance still have a copy of the paper – “Efficient and scalable statistics gathering for large databases in Oracle 11g”? I try to read it through.
Thanks
@Eric
http://tinyurl.com/3anfcqq
I’ve had an interesting discovery today that raises a question over my initial confirmation that 11g takes 10% of the time and high quality statistics. I naively expected a gather to take 10% of the time of a full table scan – it didn’t. So I turned on tracing. The first query is performing a full tablescan with no sampling to obtain column counts and min/max values. Subsequent queries for index stats then use sampling so are a lot quicker. I don’t have histograms enabled, and I expect any subsequent queries to gather histogram values would also be sampled. So overall, AUTO is quicker than 100% sampling, though only in the case where you are gathering index and/or histogram statistics.