Tagged: Statistics

Choosing An Optimal Stats Gathering Strategy

Recently the Oracle Optimizer Development Team put out a White Paper entitled Upgrading from Oracle Database 9i to 10g: What to expect from the Optimizer. This paper discusses the main differences between 9i and 10g in the subject area of the Optimizer and Statistics. As G.I. Joe said, “Now we know! And knowing is half the battle.” The other half of the battle is successfully applying that knowledge to the databases that you manage. Statistics are input to the Oracle Optimizer and the foundation of good plans. If the statistics supplied to the Oracle Optimizer are non-representative we can probably expect GIGO (Garbage In, Garbage Out). On the other hand, if the statistics are representative, chances quite good that the Oracle Optimizer will choose the optimal plan. In this post I’d like to discuss my thoughts on how to choose an optimal stats gathering strategy.

Suggested Readings

If you haven’t done so already, I would first suggest reading the following:

Start With A Clean Slate

My first recommendation is to unset any Optimizer related parameters that exist in your init.ora, unless you have specific recommendations from the application vendor. This includes (but is not limited to):

  • optimizer_index_caching
  • optimizer_index_cost_adj
  • optimizer_mode
  • db_file_multiblock_read_count

In almost every case, the defaults for these parameters are more than acceptable.

The same goes for any events and undocumented/hidden/underscore parameters that are set. Hidden parameters and events should only be used to temporarily work around bugs under the guidance and direction of Oracle Support and Development. Contrary to what you may find on the Internet via your favorite search engine, hidden parameters are not meant to be tuning mechanisms and are not a source of magic performance gains. They are mechanisms that developers have instrumented into their code to debug problems and only those developers know and understand the full impact of changing hidden parameters.

High Level Strategy

  • Start With the Defaults: In most cases, the defaults for Optimizer parameters and DBMS_STATS are adequate. If you are upgrading from 9i to 10g, do your homework and note the differences in the defaults. Test them to see if they work well for your data and your execution plans.
  • Dealing With Suboptimal Execution Plans: There may be cases of query plan regression. It is very important to be diligent about finding root cause. Often times many plan regressions surface from the same root cause. This means if you can correctly diagnose and resolve the root cause, you have the potential to resolve many plan regressions.
  • Adjust Strategy to Cope with the Exceptions: Once it is understood why the suboptimal plan was chosen, a resolution can be tested and implemented.

Start With The Defaults

I can not stress enough how important it is to start with the defaults for Optimizer and DBMS_STATS parameters. The Real-World Performance Group has dealt with numerous cases where customers are not using the default values for Optimizer parameters and by simply setting them back to the default values performance increases . If you are planning to regression test your system because you are upgrading your database, there is no better time to do a reset of these parameters. It will also put you in a much better place to troubleshoot if you run into issues. Give the software the best chance to do what it can for you. Don’t try and be too clever for your own good.

One of the most common problems I’ve seen is that customers have chosen a magic fixed percentage for ESTIMATE_PERCENT in DBMS_STATS. Why do I call it magic? Because the customers had no systematic reasoning for the value they chose.

In the Real-World Performance Group Roundtable session at Oracle OpenWorld 2007 the topic of DBMS_STATS came up and I asked “How many people are using ESTIMATE_PERCENT => DBMS_STATS.AUTO_SAMPLE_SIZE?” A handful of people raised their hands. I then asked, “For those of you who are using a fixed value for ESTIMATE_PERCENT, who can explain how they chose their number, other than just picking it out of the air.” Not one person raised their hand. Scary! The moral of the story: You should have a documented reason (and test case) to deviate from the defaults.

Dealing With Suboptimal Execution Plans

Probably the most common root cause (I’d say >90%) for suboptimal execution plans is poor cardinality estimates by the Optimizer. Poor cardinality estimates are generally the result of non-representative statistics. This means that the root cause of most Optimizer related issues are actually stats related, reaffirming how important it is to have representative stats. For more details on troubleshooting poor cardinality estimates, I would suggest reading my post on Troubleshooting Bad Execution Plans.

In 10g, Automatic SQL Tuning was introduced via Enterprise Manager (which uses the package DBMS_SQLTUNE). I would highly recommend that you evaluate this tool (if you have it licensed). I’ve found that it can often come up with quite good suggestions and fixes.

Another option that is available to help get more accurate cardinality estimates is dynamic sampling. This is probably an underused option which can help with getting more accurate cardinality estimates when there is data correlation, etc. Dynamic sampling is most appropriate for DSS and data warehouse databases, where queries run for minutes, not seconds. See my post There Is No Time Like ‘%NOW%’ To Use Dynamic Sampling for a real-world example.

Adjust Strategy To Cope With The Exceptions

There are three scenarios that seem to be rather common:

  1. Non-representative NDV with skewed data
  2. Out-of-range predicates and partitioned tables
  3. Bind peeking when histograms exist

Non-Representative NDV With Skewed Data

There are cases where the defaults for DBMS_STATS (ESTIMATE_PERCENT => DBMS_STATS.AUTO_SAMPLE_SIZE) may not yield representative NDV in 9i and 10g. One specific case that I’ve seen repeatedly is when there is a large number of values and significant data skew. In this case a fixed sample size that yields representative NDV should be chosen. For a more in-depth review of this see my post on the new 11g DBMS_STATS.AUTO_SAMPLE_SIZE, it goes through an example of the 10g AUTO_SAMPLE_SIZE NDV/skew issue.

While having accurate NDV statistics is desirable, do not be come obsessed with having perfect NDV statistics. The goal is to have the Optimizer choose the desired plan, not have perfect NDV. Having more accurate NDV may not change the plan. This is a case when less than perfect may be good enough. Don’t lose focus of the goal.

Out-of-range Predicates And Partitioned Tables

Another case that frequently comes up is usually related to out-of-range predicates with partitioned tables that do not have representative stats. Don Seiler’s write-up of his real-world case is a poster child for this exception. If you are bulk loading data into a partitioned table, it is necessary that if statistics exist, they are representative. This problem generally surfaces when statistics have been collected on an empty partition (so all stats are zeros) and now the partition has been bulk loaded. There are a few options here:

  • Stats are gathered immediately after the loading directly into the target table.
  • Data is loaded into a staging table, stats are gathered, and the staging table is partition exchanged into the target table.
  • Stats are cloned or copied (see DBMS_STATS.COPY_TABLE_STATS) from a similar partition.
  • There are no statistics and dynamic sampling will kick in (assuming it is set to the default of 2 or higher).

From what I have seen, this case generally shows up if the query plan has partition elimination to a single partition. This is because when only one partitioned is accessed only partition stats are used, but when more than one partition is accessed, both the global/table stats and partition stats are used.

Bind Peeking When Histograms Exist

The combination of these two seem to be a frequent cause of issue in 10g with OLTP systems. In my opinion, these two features should not be used together (or used with complete understanding and extreme caution). My reasoning for this is when histograms exist, execution plans can vary based on the filter predicates. Well designed OLTP systems use bind variables for execution plan reuse. On one hand plans can change, and on the other hand plans will be reused. This seems like a complete conflict of interest, hence my position of one or the other. Oh, and lets not overlook the fact that if you have a RAC database it’s possible to have a different plan on each instance depending on what the first value peeked is. Talk about a troubleshooting nightmare.

Unlike many, I think that disabling bind peeking is not the right answer. This does not address the root cause, it attempts to curb the symptom. If you are running an OLTP system, and you are using the nightly GATHER_STATS_JOB be mindful that it uses it’s own set of parameters: it overrides most of the parameters. The doc says:

When GATHER AUTO is specified, the only additional valid parameters are stattab, statid, objlist and statown; all other parameter settings are ignored.

It may be best to change the default value of the METHOD_OPT via DBMS_STATS.SET_PARAM to 'FOR ALL COLUMNS SIZE REPEAT' and gather stats with your own job. Why REPEAT and not SIZE 1? You may find that a histogram is needed somewhere and using SIZE 1 will remove it the next time stats are gathered. Of course, the other option is to specify the value for METHOD_OPT in your gather stats script.

Common Pitfalls And Problems

  • Disabling The 10g GATHER_STATS_JOB: As many of you know in 10g the GATHER_STATS_JOB was introduced. Since many customers have custom stats gathering scripts in place, many have chosen to disable this job. Disabling the GATHER_STATS_JOB entirely is not recommended because it also gathers dictionary stats (SYS/SYSTEM schemas). If you wish to collect your statistics manually, then you should change the value of AUTOSTATS_TARGET to ORACLE instead of AUTO (DBMS_STATS.SET_PARAM('AUTOSTATS_TARGET','ORACLE')). This will keep the dictionary stats up to date and allow you to manually gather stats on your schemas as you have done so in 9i.
  • Representative Statistics: When troubleshooting bad execution plans it is important to evaluate if the statistics are representative. Many times customers respond with “I just gathered statsitics” or “The statistics are recent“. Recently gathered statistics does not equate to representative statistics. Albert Einstein once said “The definition of insanity is doing the same thing over and over again and expecting different results”. It applies here as well.
  • Too Much Time Spent On Stats Gathering: Often when customers say their stats gathering is taking too long I ask to see their DBMS_STATS script. Generally there are three reasons that stats gathering is taking too long:
    1. Stats are gathered with too fine of setting for GRANULARITY: It is usually unnecessary to gather subpartition stats for composite partitioned tables. Don’t spend time collecting stats that are not needed. Don’t override the default for GRANULARITY unless you have a reason: the default probably is sufficient.
    2. Stats are gathered with unnecessarily large ESTIMATE_PERCENT: Use DBMS_STATS.AUTO_SAMPLE_SIZE to start with and adjust if necessary. No need to sample 10% when 1% or less yields representative statistics. Give DBMS_STATS.AUTO_SAMPLE_SIZE the chance to choose the best sample size.
    3. Stats are gathered more frequently than required or on data that hasn’t changed: The GATHER_STATS_JOB uses OPTIONS => 'GATHER AUTO' so it only gathers statistics on objects with more than a 10% change with a predefined set of options. If you are gathering statistics on tables/partitions that haven’t changed, or haven’t changed significantly, you may be spending time gathering unnecessary statistics. For example, there is no need to gather partition stats on last months (or older) if the data in the partition is no longer volatile.
  • Not Understanding The Data: The answers are almost always in the data…skew, correlation, etc. Many operational DBAs don’t have an in-depth understanding of the data they are managing. If this is the case, grab an engineer or analyst that is familiar with the data and work together. Two smart people working on a problem is almost always better than one!
  • Do Not Mess With Optimizer Parameters: If an execution plan is not choosing an index, understand why. Use the tools available. The GATHER_PLAN_STATISTICS hint is a prefect place to start. Fiddling with Optimizer parameters is not the solution.

Oracle Optimizer Development Team Starts A Blog

Since the introduction of the Cost Based Optimizer (CBO) in Oracle 7.0, people have been both fascinated and terrified by it and the statistics that feed it. There has long been a belief that a degree in witchcraft or black magic is required to successfully work with the CBO. Some people feel this shroud of mystery is caused in part by the lack of documentation or detailed examples about the Optimizer. In order to help remove this shroud and the black magic that surround the Optimizer, the development team responsible for it have started a blog. The blog postings will give in-depth descriptions of the technology behind the CBO, as well as practical advice like best practices for using the Optimizer. Their first blog post, Why are there more cursors in 11g for my query containing bind variables? went up on December 3rd and it discusses the controversial topic of Bind Peeking and how it has been enhanced in 11g. I recommend you check out their blog at http://optimizermagic.blogspot.com.

Oracle 11g: Extended Statistics

In the Real-World Performance Roundtable, Part 2: The Optimizer, Schema Statistics, SQL Tuning at Oracle OpenWorld 2006, I worked an example of how the optimizer can have difficulty estimating the correct cardinality when there is data correlation. (The Zodiac example can be found on pages 46-49 of the presentation.) In Oracle 11g, there has been some enhancements to help the optimizer deal with data correlation.

DBMS_STATS.CREATE_EXTENDED_STATS

Previously I blogged about the 11g enhancement to DBMS_STATS.AUTO_SAMPLE_SIZE and the new algorithm for gathering NDV. One of the other enhancements to DBMS_STATS is the CREATE_EXTENDED_STATS function. It is this function that will allow us to tell the Optimizer that two or more columns have data that is correlated.

Zodiac Calendar Example

Let’s turn to the Zodiac calendar example to demonstrate where the functionality of DBMS_STATS.CREATE_EXTENDED_STATS can be applied. As you may know, there is a correlation between the Zodiac Sign and the calendar month. Below are the Zodiac signs and the corresponding days of the month.

  • Aries - March 21 – April 20
  • Taurus - April 21 – May 21
  • Gemini - May 22 – June 21
  • Cancer - June 22 – July 22
  • Leo - July 23 -August 21
  • Virgo - August 22 – September 23
  • Libra - September 24 – October 23
  • Scorpio - October 24 – November 22
  • Sagittarius - November 23 – December 22
  • Capricorn - December 23 – January 20
  • Aquarius - January 21 – February 19
  • Pisces - February 20- March 20

For this test case I am going to load two tables, CALENDAR and PERSON. Below is a description of each.

SQL> desc calendar
 Name              Null?    Type
 ----------------- -------- ------------
 DATE_ID           NOT NULL NUMBER(8)
 MONTH             NOT NULL VARCHAR2(16)
 ZODIAC            NOT NULL VARCHAR2(16)

SQL> desc person
 Name              Null?    Type
 ----------------- -------- ------------
 PERSON_ID         NOT NULL NUMBER(10)
 DATE_ID           NOT NULL NUMBER(8)

The CALENDAR table has 365 rows, one row for every day of the calendar year. The PERSON table has 32,768 rows for each DAY_ID (each day of the year) for a total of 11,960,320 rows.

There are a few indexes I’m building on the tables:

  • Unique index on PERSON(PERSON_ID)
  • Unique index on CALENDAR(DATE_ID)
  • Non-Unique index on PERSON(DATE_ID)

Now that the tables loaded and indexes created, it’s time to create the Extended Stats. Below is a portion of the documentation.


CREATE_EXTENDED_STATS Function

This function creates a column statistics entry in the system for a user specified column group or an expression in a table. Statistics for this extension will be gathered when user or auto statistics gathering job gathers statistics for the table. We call statistics for such an extension, “extended statistics”. This function returns the name of this newly created entry for the extension.

Syntax

DBMS_STATS.CREATE_EXTENDED_STATS (
   ownname    VARCHAR2,
   tabname    VARCHAR2,
   extension  VARCHAR2)
 RETURN VARCHAR2;

Parameters

Table 127-8 CREATE_EXTENDED_STATS Function Parameters

Parameter Description
ownname Owner name of a table
tabname Name of the table
extension Can be either a column group or an expression. Suppose the specified table has two column c1, c2. An example column group can be “(c1, c2)” and an example expression can be “(c1 + c2)”.

Return Values
This function returns the name of this newly created entry for the extension.


Since there is a correlation between the MONTH and ZODIAC columns in the CALENDAR table, the column group for the extended statistics will be (MONTH, ZODIAC).

Here is the command to create the extended stats:
SELECT DBMS_STATS.CREATE_EXTENDED_STATS(USER, 'CALENDAR', '(MONTH, ZODIAC)') FROM DUAL;

Now that we have the extended stats definition created, it’s time to gather stats. Here are the commands I’m using to gather stats:

BEGIN
 DBMS_STATS.GATHER_TABLE_STATS
 (
  OWNNAME => USER
 ,TABNAME => 'CALENDAR'
 ,ESTIMATE_PERCENT => NULL
 ,METHOD_OPT => 'FOR ALL COLUMNS SIZE SKEWONLY'
 );
END;
/
BEGIN
 DBMS_STATS.GATHER_TABLE_STATS
 (
  OWNNAME => USER
 ,TABNAME => 'PERSON'
 ,ESTIMATE_PERCENT => NULL
 );
END;
/

Lets look at the column stats on the two tables:

SELECT
   TABLE_NAME,
   COLUMN_NAME,
   NUM_DISTINCT as NDV,
   NUM_BUCKETS,
   SAMPLE_SIZE,
   HISTOGRAM
FROM
   USER_TAB_COL_STATISTICS
ORDER BY 1,2;
TABLE_NAME COLUMN_NAME                    NDV      NUM_BUCKETS SAMPLE_SIZE HISTOGRAM
---------- ------------------------------ -------- ----------- ----------- ---------------
CALENDAR   DATE_ID                             365         254         365 HEIGHT BALANCED
CALENDAR   MONTH                                12          12         365 FREQUENCY
CALENDAR   SYS_STUWHPY_ZSVI_W3#C$I3EUUYB4       24          24         365 FREQUENCY
CALENDAR   ZODIAC                               12          12         365 FREQUENCY
PERSON     DATE_ID                             365           1    11960320 NONE
PERSON     PERSON_ID                      11960320           1    11960320 NONE

As you can see, there are column statistics gathered on column group of CALENDAR.(MONTH, ZODIAC) represented by the SYS_STUWHPY_ZSVI_W3#C$I3EUUYB4 column.

The Moment of Truth

Will the extended statistics be enough to give the optimizer the information it needs to estimate an accurate number of rows? Let’s test it by running three test cases:

  1. How many people have a birth month of May?
  2. How many people have a Zodiac sign of Taurus?
  3. How many people have a birth month of May and a Zodiac sign of Taurus?

Each query is run with a /*+ gather_plan_statistics */ hint followed by
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL, NULL, 'ALLSTATS LAST'));
The goal is to have the E-Rows (Optimizer Estimated Rows) be statistically accurate of the A-Rows (Actual Rows).

Below is the output from DBMS_XPLAN.DISPLAY_CURSOR for each of the three test cases.

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------
SQL_ID  55qv2rt3k8b3w, child number 0
-------------------------------------
select /*+ gather_plan_statistics */  count(*)
from  person p ,calendar c
where p.date_id = c.da te_id and month = 'may'

Plan hash value: 1463406140

--------------------------------------------------------------------
| Id  | Operation           | Name      | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------
|   1 |  SORT AGGREGATE     |           |      1 |      1 |      1 |
|   2 |   NESTED LOOPS      |           |      1 |   1015K|   1015K|
|*  3 |    TABLE ACCESS FULL| CALENDAR  |      1 |     31 |     31 |
|*  4 |    INDEX RANGE SCAN | PERSON_N1 |     31 |  32768 |   1015K|
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("MONTH"='may')
   4 - access("P"."DATE_ID"="C"."DATE_ID")



PLAN_TABLE_OUTPUT
--------------------------------------------------------------------
SQL_ID  8y54wtmy228r0, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(*)
from  person p ,calendar c
where p.date_id = c.date_id and zodiac = 'taurus'

Plan hash value: 1463406140

--------------------------------------------------------------------
| Id  | Operation           | Name      | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------
|   1 |  SORT AGGREGATE     |           |      1 |      1 |      1 |
|   2 |   NESTED LOOPS      |           |      1 |   1015K|   1015K|
|*  3 |    TABLE ACCESS FULL| CALENDAR  |      1 |     31 |     31 |
|*  4 |    INDEX RANGE SCAN | PERSON_N1 |     31 |  32768 |   1015K|
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("ZODIAC"='taurus')
   4 - access("P"."DATE_ID"="C"."DATE_ID")


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------
SQL_ID  8ntkxs4ztb2rz, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(*)
from  person p ,calendar c
where p.date_id = c.date_id and zodiac = 'taurus' and month = 'may'

Plan hash value: 1463406140

--------------------------------------------------------------------
| Id  | Operation           | Name      | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------
|   1 |  SORT AGGREGATE     |           |      1 |      1 |      1 |
|   2 |   NESTED LOOPS      |           |      1 |    688K|    688K|
|*  3 |    TABLE ACCESS FULL| CALENDAR  |      1 |     21 |     21 |
|*  4 |    INDEX RANGE SCAN | PERSON_N1 |     21 |  32768 |    688K|
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter(("ZODIAC"='taurus' AND "MONTH"='may'))
   4 - access("P"."DATE_ID"="C"."DATE_ID")

Summary

As demonstrated, adding Extended Statistics and using Histograms allowed the Optimizer to accurately estimate the number of rows, even when there was data correlation. This is a very useful enhancement to assist the Optimizer when there is known data correlation.