Many of you are aware of the Oracle 11g Database New Features and while some may be generally interested in new features, one area that I focus on is new features that yield gains in performance. Some of these features can be found in the General Server Performance section of the Oracle 11g Database New Features documentation. There is one area (for now…) that didn’t make this list but I feel is worth mentioning - performance enhancements made to `DBMS_STATS`

.

# The Necessity of Representative Statistics

Representative statistics are the foundation that the Optimizer relies on to make the best decisions when choosing execution plans. One recent blog post from Don Seiler, with the help of Wolfgang Breitling, is a prefect real-world case. This blog post dealt with out-of-range values, but one other case that often causes headaches is data skew. In the Real-World Performance Roundtable, Part II session at OracleWorld 2006, I discussed a basic stats gathering strategy that dealt with the exception case of data skew. When using the `DBMS_STATS`

default of `DBMS_STATS.AUTO_SAMPLE_SIZE`

in 10g and 9i, the NDV (Number of Distinct Values) may be statistically inaccurate when there is significant data skew. In order to deal with this exception, a fixed percentage of data that yields statistically representative NDV counts should be chosen.

## 11g DBMS_STATS

In 11g there have been some enhancements made to the `DBMS_STATS`

package. Overall the `GATHER_*`

processes run faster but what stands out to me is the speed and accuracy that `DBMS_STATS.AUTO_SAMPLE_SIZE`

now gives. As a performance person, I often times make reference to letting the numbers tell the story, so lets dive into a comparison between 10.2.0.3 and 11.1.0.5. I’ve chosen the same data set that I used in the “Refining the Stats” section of Real-World Performance Roundtable, Part II session. Stats were serially gathered with `ESTIMATE_PERCENT`

of 10%, 100%, and `DBMS_STATS.AUTO_SAMPLE_SIZE`

.

**10.2.0.3**

run# | AUTO_SAMPLE_SIZE | 10% | 100% |
---|---|---|---|

1 | 00:07:53.97 | 00:04:18.87 | 00:09:22.15 |

2 | 00:09:06.09 | 00:04:18.95 | 00:09:13.28 |

3 | 00:07:46.23 | 00:03:52.50 | 00:09:18.11 |

4 | 00:07:55.43 | 00:04:02.94 | 00:09:20.54 |

5 | 00:09:43.30 | 00:03:49.96 | 00:09:16.38 |

**11.1.0.5**

run# | AUTO_SAMPLE_SIZE | 10% | 100% |
---|---|---|---|

1 | 00:02:39.31 | 00:02:38.55 | 00:07:37.83 |

2 | 00:02:21.86 | 00:02:31.56 | 00:08:24.10 |

3 | 00:02:38.11 | 00:02:49.49 | 00:07:38.25 |

4 | 00:02:26.60 | 00:02:27.75 | 00:07:42.25 |

5 | 00:02:29.95 | 00:02:29.45 | 00:07:42.49 |

# 11g DBMS_STATS Observations

As you can see by the numbers, 11g pulls a win in each of the three `GATHER_TABLE_STATS`

calls. Take note of the `AUTO_SAMPLE_SIZE`

timings. The 11g `AUTO_SAMPLE_SIZE`

gather takes the same time as the 11g 10% sample. Not bad!

## NDV Accuracy

We’ve seen that the 11g gather stats is overall faster and that the 11g `AUTO_SAMPLE_SIZE`

shows a significant improvement in speed compared to 10.2.0.3 `AUTO_SAMPLE_SIZE`

for this table, but how do the NDV calculations compare? Again, let’s look at the numbers. I’ve queried `USER_TAB_COL_STATISTICS`

to get the NDV and `SAMPLE_SIZE`

for our skewed data set.

### 10.2.0.3

```
ESTIMATE_PERCENT => 10
COLUMN_NAME NUM_DISTINCT NUM_NULLS SAMPLE_SIZE
--------------- ------------ ---------- -----------
C1 31464 0 2148910
C2 608544 0 2148910
C3 359424 0 2148910
ESTIMATE_PERCENT => 100%
COLUMN_NAME NUM_DISTINCT NUM_NULLS SAMPLE_SIZE
--------------- ------------ ---------- -----------
C1 60351 0 21456269
C2 1289760 0 21456269
C3 777942 0 21456269
ESTIMATE_PERCENT => DBMS_STATS.AUTO_SAMPLE_SIZE
COLUMN_NAME NUM_DISTINCT NUM_NULLS SAMPLE_SIZE
--------------- ------------ ---------- -----------
C1 1787 0 5823
C2 367075 0 576909
C3 52464 0 57431
```

### 11.1.0.5

```
ESTIMATE_PERCENT => 10
COLUMN_NAME NUM_DISTINCT NUM_NULLS SAMPLE_SIZE
--------------- ------------ ---------- -----------
C1 31320 0 2147593
C2 608814 0 2147593
C3 359365 0 2147593
ESTIMATE_PERCENT => 100
COLUMN_NAME NUM_DISTINCT NUM_NULLS SAMPLE_SIZE
--------------- ------------ ---------- -----------
C1 60351 0 21456269
C2 1289760 0 21456269
C3 777942 0 21456269
ESTIMATE_PERCENT => DBMS_STATS.AUTO_SAMPLE_SIZE
COLUMN_NAME NUM_DISTINCT NUM_NULLS SAMPLE_SIZE
--------------- ------------ ---------- -----------
C1 59852 0 21456269
C2 1270912 0 21456269
C3 768384 0 21456269
```

As expected, the 100% samples are identical and the 10% samples are statistically equivalent. One interesting data point is that the `SAMPLE_SIZE`

for the 11g `AUTO_SAMPLE_SIZE`

run shows the exact `SAMPLE_SIZE`

as the 100% gather - the total number of rows in the table. Also note that the NDV counts for the 11g `AUTO_SAMPLE_SIZE`

gather are statistically equivalent to the 100% sample. What does this mean? It means that the 11g `AUTO_SAMPLE_SIZE`

had been enhanced to provide nearly 100% sample accuracy, even on skewed data sets.

# Summary

Overall the 11g `DBMS_STATS`

has been enhanced to gather stats in less time, but in my opinion the significant enhancement is to `AUTO_SAMPLE_SIZE`

which yields near 100% sample accuracy in 10% sample time. As the documentation says:

…Oracle recommends setting the

`ESTIMATE_PERCENT`

parameter of the`DBMS_STATS`

gathering procedures to`DBMS_STATS.AUTO_SAMPLE_SIZE`

to maximize performance gains while achieving necessary statistical accuracy.

I couldn’t agree with the documentation more. If you wish to know more about how the new `DBMS_STATS.AUTO_SAMPLE_SIZE`

works, see section 3 of *Efficient and scalable statistics gathering for large databases in Oracle 11g*.