DBMS_STATS, METHOD_OPT and FOR ALL INDEXED COLUMNS

I’ve written before on choosing an optimal stats gathering strategy but I recently came across a scenario that I didn’t directly blog about and think it deserves attention. As I mentioned in that previous post, one should only deviate from the defaults when they have a reason to, and fully understand that reason and the effect of that decision.

Understanding METHOD_OPT

The METHOD_OPT parameter of DBMS_STATS controls two things:

  1. on which columns statistics will be collected
  2. on which columns histograms will be collected (and how many buckets)

It is very important to understand #1 and how the choice of METHOD_OPT effects the collection of column statistics.

Prerequisite: Where Do I Find Column Statistics?

Understanding where to find column statistics is vital for troubleshooting bad execution plans. These views will be the arrows in your quiver:

  • USER_TAB_COL_STATISTICS
  • USER_PART_COL_STATISTICS
  • USER_SUBPART_COL_STATISTICS

Depending on if the table is partitioned or subpartitioned, and depending on what GRANULARITY the stats were gathered with, the latter two of those views may or may not be populated.

The Bane of METHOD_OPT: FOR ALL INDEXED COLUMNS

If you are using FOR ALL INDEXED COLUMNS as part of your METHOD_OPT you probably should not be. Allow me to explain. Using MENTOD_OPT=>'FOR ALL INDEXED COLUMNS SIZE AUTO' (a common METHOD_OPT I see) tells DBMS_STATS: “only gather stats on columns that participate in an index and based on data distribution and the workload of those indexed columns decide if a histogram should be created and how many buckets it should contain“. Is that really what you want? My guess is probably not. Let me work through a few examples to explain why.

I’m going to start with this table.

SQL> exec dbms_random.initialize(1);

PL/SQL procedure successfully completed.

SQL> create table t1
  2  as
  3  select
  4    column_value                    pk,
  5    round(dbms_random.value(1,2))   a,
  6    round(dbms_random.value(1,5))   b,
  7    round(dbms_random.value(1,10))  c,
  8    round(dbms_random.value(1,100)) d,
  9    round(dbms_random.value(1,100)) e
 10  from table(counter(1,1000000))
 11  /

Table created.

SQL> begin
  2    dbms_stats.gather_table_stats(
  3      ownname => user ,
  4      tabname => 'T1' ,
  5      estimate_percent => 100 ,
  6      cascade => true);
  7  end;
  8  /

PL/SQL procedure successfully completed.

SQL> select
  2    COLUMN_NAME, NUM_DISTINCT, HISTOGRAM, NUM_BUCKETS,
  3    to_char(LAST_ANALYZED,'yyyy-dd-mm hh24:mi:ss') LAST_ANALYZED
  4  from user_tab_col_statistics
  5  where table_name='T1'
  6  /

COLUMN_NAME NUM_DISTINCT HISTOGRAM       NUM_BUCKETS LAST_ANALYZED
----------- ------------ --------------- ----------- -------------------
PK               1000000 NONE                      1 2008-13-10 18:39:51
A                      2 NONE                      1 2008-13-10 18:39:51
B                      5 NONE                      1 2008-13-10 18:39:51
C                     10 NONE                      1 2008-13-10 18:39:51
D                    100 NONE                      1 2008-13-10 18:39:51
E                    100 NONE                      1 2008-13-10 18:39:51

6 rows selected.

This 6 column table contains 1,000,000 rows of randomly generated numbers. I’ve queried USER_TAB_COL_STATISTICS to display some of the important attributes (NDV, Histogram, Number of Buckets, etc).

I’m going to now put an index on T1(PK), delete the stats and recollect stats using two different METHOD_OPT parameters that each use 'FOR ALL INDEXED COLUMNS'.

SQL> create unique index PK_T1 on T1(PK);

Index created.

SQL> begin
  2    dbms_stats.delete_table_stats(user,'T1');
  3
  4    dbms_stats.gather_table_stats(
  5      ownname => user ,
  6      tabname => 'T1' ,
  7      estimate_percent => 100 ,
  8      method_opt => 'for all indexed columns' ,
  9      cascade => true);
 10  end;
 11  /

PL/SQL procedure successfully completed.

SQL> select COLUMN_NAME, NUM_DISTINCT, HISTOGRAM, NUM_BUCKETS,
  2  to_char(LAST_ANALYZED,'yyyy-dd-mm hh24:mi:ss') LAST_ANALYZED
  3  from user_tab_col_statistics
  4  where table_name='T1'
  5  /

COLUMN_NAME NUM_DISTINCT HISTOGRAM       NUM_BUCKETS LAST_ANALYZED
----------- ------------ --------------- ----------- -------------------
PK               1000000 HEIGHT BALANCED          75 2008-13-10 18:41:10

SQL> begin
  2    dbms_stats.delete_table_stats(user,'T1');
  3
  4    dbms_stats.gather_table_stats(
  5      ownname => user ,
  6      tabname => 'T1' ,
  7      estimate_percent => 100 ,
  8      method_opt => 'for all indexed columns size auto' ,
  9      cascade => true);
 10  end;
 11  /

PL/SQL procedure successfully completed.

SQL> select COLUMN_NAME, NUM_DISTINCT, HISTOGRAM, NUM_BUCKETS,
  2  to_char(LAST_ANALYZED,'yyyy-dd-mm hh24:mi:ss') LAST_ANALYZED
  3  from user_tab_col_statistics
  4  where table_name='T1'
  5  /

COLUMN_NAME NUM_DISTINCT HISTOGRAM       NUM_BUCKETS LAST_ANALYZED
----------- ------------ --------------- ----------- -------------------
PK               1000000 NONE                      1 2008-13-10 18:41:12

Notice that in both cases only column PK has stats on it. Columns A,B,C,D and E do not have any stats collected on them. Also note that when no SIZE clause is specified, it defaults to 75 buckets.

Now one might think that is no big deal or perhaps they do not realize this is happening because they do not look at their stats. Let’s see what we get for cardinality estimates from the Optimizer for a few scenarios.

SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5  /

  COUNT(*)
----------
    500227

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  4df0g0r99zmba, child number 0
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.24 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |  10000 |    500K|00:00:00.50 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("A"=1)

Notice the E-Rows estimate for T1. The Optimizer is estimating 10,000 rows when in reality there is 500,227. The estimate is off by more than an order of magnitude (50x). Normally the calculation for the cardinality would be (for a one table single equality predicate):
number of rows in T1 * 1/NDV = 1,000,000 * 1/2 = 500,000
but in this case 10,000 is the estimate. Strangely enough (or not), 10,000 is exactly 0.01 (1%) of 1,000,000. Because there are no column stats for T1.A, the Optimizer is forced to make a guess, and that guess is 1%.

As you can see from the 10053 trace (below), since there are no statistics on the column, defaults are used. In this case they yield very poor cardinality estimations.

SINGLE TABLE ACCESS PATH
  -----------------------------------------
  BEGIN Single Table Cardinality Estimation
  -----------------------------------------
  Column (#2): A(NUMBER)  NO STATISTICS (using defaults)
    AvgLen: 13.00 NDV: 31250 Nulls: 0 Density: 3.2000e-05
  Table: T1  Alias: T1
    Card: Original: 1000000  Rounded: 10000  Computed: 10000.00  Non Adjusted: 10000.00
  -----------------------------------------
  END   Single Table Cardinality Estimation
  -----------------------------------------

Now that I’ve demonstrated how poor the cardinality estimation was with a single equality predicate, let’s see what two equality predicates gives us for a cardinality estimate.

SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5    and b=3
  6  /

  COUNT(*)
----------
    124724

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  ctq8q59qdymw6, child number 0
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1   and b=3

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.19 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |    100 |    124K|00:00:00.25 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("A"=1 AND "B"=3))

Yikes. In this case the cardinality estimate is 100 when the actual number of rows is 124,724, a difference of over 3 orders of magnitude (over 1000x). Where did the 100 row estimate come from? In this case there are two equality predicates so the selectivity is calculated as 1% * 1% or 0.01 * 0.01 = 0.0001. 1,000,000 * 0.0001 = 100. Funny that. (The 1% is the default selectivity for an equality predicate w/o stats.)

Now let’s add a derived predicate as well and check the estimates.

SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5    and b=3
  6    and d+e > 50
  7  /

  COUNT(*)
----------
    109816

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  5x200q9rqvvfu, child number 0
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1   and b=3
 and d+e > 50

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.22 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |      5 |    109K|00:00:00.33 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("A"=1 AND "B"=3 AND "D"+"E">50))

Doh! The cardinality estimate is now 5, but the actual number of rows being returned is 109,816. Not good at all. The Optimizer estimated 5 rows because it used a default selectivity of 1% (for A=1) * 1% (for B=3) * 5% (for D+E > 50) * 1,000,000 rows. Now can you see why column statistics are very important? All it takes is a few predicates and the cardinality estimation becomes very small, very fast. Now consider this:

  • What is likely to happen in a data warehouse where the queries are 5+ table joins and the fact table columns do not have indexes?
  • Would the Optimizer choose the correct driving table?
  • Would nested loops plans probably be chosen when it is really not appropriate?

Hopefully you can see where this is going. If you don’t, here is the all too common chain of events:

  • Non representative (or missing) statistics lead to
  • Poor cardinality estimates which leads to
  • Poor access path selection which leads to
  • Poor join method selection which leads to
  • Poor join order selection which leads to
  • Poor SQL execution times

Take 2: Using the Defaults

Now I’m going to recollect stats with a default METHOD_OPT and run through the 3 execution plans again:

SQL> begin
  2    dbms_stats.delete_table_stats(user,'t1');
  3
  4    dbms_stats.gather_table_stats(
  5      ownname => user ,
  6      tabname => 'T1' ,
  7      estimate_percent => 100 ,
  8      degree => 8,
  9      cascade => true);
 10  end;
 11  /

PL/SQL procedure successfully completed.

SQL> select column_name, num_distinct, histogram, NUM_BUCKETS,
  2  to_char(LAST_ANALYZED,'yyyy-dd-mm hh24:mi:ss') LAST_ANALYZED
  3  from user_tab_col_statistics where table_name='T1'
  4  /

COLUMN_NAME NUM_DISTINCT HISTOGRAM       NUM_BUCKETS LAST_ANALYZED
----------- ------------ --------------- ----------- -------------------
PK               1000000 NONE                      1 2008-13-10 19:44:32
A                      2 FREQUENCY                 2 2008-13-10 19:44:32
B                      5 FREQUENCY                 5 2008-13-10 19:44:32
C                     10 FREQUENCY                10 2008-13-10 19:44:32
D                    100 NONE                      1 2008-13-10 19:44:32
E                    100 NONE                      1 2008-13-10 19:44:32

6 rows selected.
SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5  /

  COUNT(*)
----------
    500227

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  4df0g0r99zmba, child number 0
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.20 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |    500K|    500K|00:00:00.50 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("A"=1)
SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5    and b=3
  6  /

  COUNT(*)
----------
    124724

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  ctq8q59qdymw6, child number 0
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1   and b=3

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.14 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |    124K|    124K|00:00:00.25 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("B"=3 AND "A"=1))
SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5    and b=3
  6    and d+e > 50
  7  /

  COUNT(*)
----------
    109816

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  5x200q9rqvvfu, child number 0
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1   and b=3
 and d+e>50

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.17 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |   6236 |    109K|00:00:00.22 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("B"=3 AND "A"=1 AND "D"+"E">50))

As you can see, the first two queries have spot on cardinality estimates, but the the third query isn’t as good as it uses a column combination and there are no stats on D+E columns, only D and E individually. I’m going to rerun the third query with dynamic sampling set to 4 (in 10g it defaults to 2) and reevaluate the cardinality estimate.

SQL> alter session set optimizer_dynamic_sampling=4;

Session altered.

SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5    and b=3
  6    and d+e > 50
  7  /

  COUNT(*)
----------
    109816

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  5x200q9rqvvfu, child number 1
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1   and b=3
 and d+e > 50

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.17 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |    102K|    109K|00:00:00.22 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("B"=3 AND "A"=1 AND "D"+"E">50))

Note
-----
   - dynamic sampling used for this statement

Bingo! Close enough to call statistically equivalent.

Summary

I hope this little exercise demonstrates how important it is to have representative statistics and that when statistics are representative the Optimizer can very often accurately estimate the cardinality and thus choose the best plan for the query. Remember these points:

  • Recent statistics do not necessarily equate to representative statistics.
  • Statistics are required on all columns to yield good plans – not just indexed columns.
  • You probably should not be using METHOD_OPT => 'FOR ALL INDEXED COLUMNS SIZE AUTO', especially in a data warehouse where indexes are used sparingly.
  • Dynamic Sampling can assist with cardinality estimates where existing stats are not enough.

Tests performed on 10.2.0.4

48 comments

  1. Mladen Gogala

    I have never seen anybody using ‘FOR ALL INDEXED COLUMNS SIZE AUTO’ but I do use ‘FOR ALL INDEXED COLUMNS SIZE 251′ on the regular basis. Precise histograms are needed in order for the CBO to be able to select a correct access path. If the selected access path is not correct, for whatever reason, there are hints to fix it in 0.001% of the cases. The columns that really do need histograms, where it really does matter, are the indexed columns. Those are the only columns where the optimizer can choose the access path. Histograms will take quite a bit of space in SYSAUX tablespace and that space was the reason not to collect histograms for all columns in the old days of 8i. Now, 1TB disks are commonplace and 0.5TB disks are getting into SAN configurations, so there is no need for maniacal space saving exercises. BTW, I use the number 251 because it is a prime number. I do agree with your “size auto” suggestion, I don’t really trust Oracle to “figure out” my statistics, based on an unpublished algorithm, probably using the Force. I do want good histograms on my indexed columns, even if it means that the unique indexes, for which histograms do not make sense will also be processed.

  2. Todor Botev

    Thanks for the post! It gave me a very good insight.

    Can not representative (or not available) histograms have a similar effect?

  3. Greg Rahn

    @Virag
    This was on 10.2.0.4. I’ve updated the post to note it as well.

    @Mladen
    Interesting comment that you “don’t trust Oracle to figure out your statistics” because the algorithm is unpublished, yet you seem to run Oracle software and it is not open source. Seems a bit contradictory to me. Also seems that you use “the Force” a bit yourself: choosing 251 for the bucket count just because it is a prime number? 254 is a computer number (power of 2) so wouldn’t that make just as much sense or more? It would be nice to have some more details of how your strategy is better than the default options based of the number of better/same/worse plans you have seen for your database/workload.

    @Todor
    By similar effect do you mean very small estimated cardinality?

  4. jason arneil

    I’m expecting I’m gonna end up with egg on my face here:

    “254 is a computer number (power of 2)”

    Are you sure you are not confusing that with 256 – which is 2^8

    Not that this is exactly relevant to the above.

    One thing on your examples though, am I right in saying that BOTH the accurate estimated number of rows and the inaccurate estimated number of rows produce the SAME execution plan?

    jason.

  5. Mladen Gogala

    Greg, I am trying to keep on top and understand the underlying mechanism. I’ve seen cases where “SIZE AUTO” produces only 2 buckets, minimum and maximum, which is, of course, useless. By forcing 251 buckets, I am less likely to encounter unpleasant surprises with my histograms. One more thing: 254 is NOT a power of 2. The prime number is just a strong personal preference, not really based on any rigorous logical process. I’ve never had any problems with that number.

  6. Greg Rahn

    @Jason/Mladen
    Argh…Sometimes the brain lags the keyboard…I was thinking 256 (power of 2), but in reality it is 254 which != 2^8. I guess I should have used something like 254 in hex is FE which is the periodic symbol for iron. Anyway, the point being I find little relevance in “magic” numbers, especially unproven and untested ones.

    @Mladen
    I guess if it works for you, then great. Personally I would never force any fixed collection of histograms and certainly not based on any “magic” number. Especially when it appears you have no comparison point to determine better or worse. I find it interesting that many people do not want histograms and you are forcing them everywhere. How many databases are you managing this way, how large are they, and what kind of workloads run on them?

    @Jason
    The examples I chose there is only 1 table, and no indexes, so there is only 1 plan. I’m using that example to demonstrate just the cardinality estimates in its most simple form. Hopefully it can be understood that if one had a two table join and there was a poor cardinality estimate, the incorrect driving table could be chosen or it may do a NL join when you want a HASH join (usually a big problem on big data sets). As you add more tables, the number of possible plans increases so having bad (or no) stats give more opportunity for things to be wrong. Maybe I’ll put some multi-table examples together.

    Consider representative stats (and accurate cardinality estimates) the foundation for good execution plans.

  7. Chris Adkin

    Going off at a slight tangent here, but what is your preferred method for handling columns with highly skewed values for which there are more than 254 distinct values ?, dynamic sampling, the DBMS_STATS api which allows you to get end points for the values you are really interested in ?.

    Chris

  8. Greg Rahn

    @Chris

    Histograms can be used when there are more than 254 distinct values, it just means a height balanced histogram is used instead of a frequency histogram. Dynamic Sampling can also assist in getting accurate cardinality estimates, but it is best used for longer running (read minutes) queries in a BI/DW/DSS database vs. OLTP queries. I would discourage using the DBMS_STATS set routines if you can use one of the other methods. It could get messy and there really isn’t enough trace information available to debug it if it goes bad.

  9. Wolfgang Breitling

    In response to Mladen:
    “The columns that really do need histograms, where it really does matter, are the indexed columns. Those are the only columns where the optimizer can choose the access path.”
    That is a common misconception: that histograms only matter for indexed columns. Choosing an access method ( index or tablescan ) for a row source is only part of the task of the CBO when putting together an access plan. Other, at least equally important, tasks are the types and especially order of joins – unless none of your sql contain joins. And it is there where column selectivities, derived from column statistics, play a crucial role. If you deprive the optimizer of this vital information it is bound to produce suboptimal, even very bad plans. If you have not yet encountered that count yourself lucky.

  10. Kumar

    Hi Greg:
    With the option to gather histograms on skew columns available, would it be safe to collect the histograms of all the columns using this option.

    Thank you
    - Kumar

  11. Greg Rahn

    @Kumar
    I’m not sure if I understand your question. Are you asking if it is safe to use METHOD_OPT=>’FOR ALL COLUMNS SKEWONLY’? What are the exact dbms_stats command you would be using or comparing?

  12. Kumar

    Hi Greg
    So it would be something like this

    begin
    dbms_stats.gather_table_stats(
    ownname => ‘PA’ ,
    tabname => ‘PA_EXPENDITURE_TYPES’ ,
    estimate_percent => 100 ,
    method_opt => ’FOR ALL COLUMNS SKEWONLY size AUTO’,
    degree => 8,
    cascade => true);
    end;

  13. Greg Rahn

    @Kumar
    I’m not sure what exactly “safe” means, but this command will look at the data for each and every column and if it determines there is enough skew to warrant a histogram it will create one. I would also comment that a 100% sample is quite large and ask if it is really necessary. I would suggest to start with dbms_stats.auto_sample_size as a starting point.

    Are you seeing plans in which the cardinality estimates are not representative and you are attributing that to data skew? After looking at the data you do see skew in the distribution of values?

  14. Kumar

    Hi Greg:
    From what I understood, we should not create histograms on all the columns and it can be detrimental in some cases. From that aspect I was asking if it is safe to use this skew only option. Yes I think based on the data distribution on some tables that the query uses, there is data skew and histograms will probably help make the right decision. So I have this query which was taking a long time to ccomplete according to developers and I put the hint to do a FTS and it completes in 22 minutes as opposed to 2.5 hrs. I want to remove the hint and tune it the right way and was thinking about creating histograms based on the skewness.
    One other option that I was thinkig was to use dynamic sampling as discussed in another blog of yours.

    Thank you
    - kumar

  15. Greg Rahn

    @Kumar

    So I have this query which was taking a long time to complete according to developers and I put the hint to do a FTS and it completes in 22 minutes as opposed to 2.5 hrs. I want to remove the hint and tune it the right way and was thinking about creating histograms based on the skewness.

    With the information provided thus far, I’m not convinced that a histogram is the solution as I am unclear where the problem specifically lies. Do you have evidence that a predicate column (join or filter) has significant skew and the cardinality estimate is too low in this case? Have you simplified this down to a specific table and column? If not, that is ok, but we’ll need to get there to correctly solve this issue.

  16. John

    how does using “FOR ALL COLUMNS SIZE 1″ compare to not using any method_opt?

  17. Greg Rahn

    @John
    In 9i METHOD_OPT defaults to ‘FOR ALL COLUMNS SIZE 1′.
    In 10g and newer METHOD_OPT defaults to ‘FOR ALL COLUMNS SIZE AUTO’.
    When you don’t explicitly cite it in the DBMS_STATS call, it defaults to one of the above.

  18. devinder

    somehow i dont agree with you, why would you run a select against a table with millions of rows and no index on the column in where clause. Its going to do a full table scan anyway, whether histograms are available or not. Histograms are not cheap to come by at least till 10g. Wont you prefer to have histograms on indexed columns and then you know the skewness of the data in the indexed columns and thus help optimizer generate an optimal plan.
    the option method_opt=>’for all columns size whatever’ not only takes a long time to generate stats on the columns at the same time it generates sub optimal plans, and there are many articles on metalink which suggest not to use this option..

  19. Greg Rahn

    @devinder
    What exactly don’t you agree with?

    You should recognize the examples are not about full table scan or not. They are about getting accurate cardinality estimates. I prefer to have a histogram on a skewed column regardless if it participates in an index or not. See, indexes are used as a access method, not a way to calculate cardinality (other than a couple exceptions).

    There are many, many data warehouses that run queries against tables with million or billions of rows without indexes on the predicate columns and possibly many of those unindexed columns have histograms if skew is present. Recognize this discussion is not about gathering histograms (or not) on columns which are indexed. It is about gathering column statistics for columns that do not participate in indexes. Not to be too critical, but 10g has been out for over 5 years now, and 11g for over 1 year, so I’m not sure the weight a pre-10g argument contains these days, especially when issues have been fixed.

    If you would be so kind as to cite what Metalink notes make such suggestions, it would be appreciated.

  20. Henry

    While referring back to this article for the umpteenth time, I just noticed one minor(?) discrepancy. I actually noticed it because I was trying to set up my own test case. The first time you gather stats (method_opt is the default), there are no histograms created. When you gather stats at the beginning of ‘Take 2′, using the same parameters (except for degree=>8 and the unique index already in place, neither of which make a difference), you see histograms.

    What’s up with that?

    I was looking at Adaptive Cursor Sharing with Dynamic Sampling and was wondering if having method_opt automatically detect skew might be related to dynamic sampling detecting skew, which might affect ACS (when I had no histograms with the default method_opt, dynamic sampling (level 4) didn’t take advantage of ACS).

  21. Greg Rahn

    @Henry

    The reason that histograms were gathered in the “Take 2″ version and not in the first collection is that sys.col_usage$ has been populated because I ran 3 queries against the table T1. In the first collection, no queries had been run against T1 so it was impossible for “size auto” to collect any histograms as sys.col_usage$ contained no rows for T1. This is something to be very mindful of; if you are using the default method_opt, the predicates in the queries executed against the tables can trigger new histograms to be collected.

    You can use this query to look at sys.col_usage$

    select	oo.name owner,
    	o.name table_name,
    	c.name column_name,
    	u.equality_preds,
    	u.equijoin_preds,
    	u.nonequijoin_preds,
    	u.range_preds,
    	u.like_preds,
    	u.null_preds,
    	u.timestamp
    from	sys.col_usage$ u,
    	sys.obj$ o,
    	sys.user$ oo,
    	sys.col$ c
    where	o.obj#   = u.obj#
    and	oo.user# = o.owner#
    and	c.obj#   = u.obj#
    and	c.col#   = u.intcol#;
    

    If there are no histograms then Adaptive Cursor Sharing will not really change anything. Adaptive Cursor Sharing was designed for the situation where you have a statement with bind variables and histograms exist on the table. Adaptive Cursor Sharing is designed to give a new child cursor and plan if the combination of the histogram and peeked predicate value would give a new plan had it been a literal value.

  22. Henry

    Thanks for the reminder on col_usage$, I forgot all about that.

    I understand that if a new value for a bind variable leads to degraded performance, ACS will (after the second run) lead to a new child cursor and potentially new execution plan. A better plan would be possible only with a histogram present.

    I was wondering about dynamic sampling. A new bind value will still degrade performance, which can lead to a new parse/plan via ACS. Even with no histograms (no stats) dynamic sampling could yield a new plan as the skew can be determined. My initial tests show this doesn’t happen and dynamic sampling doesn’t yield a new plan, even with skewed data, using ACS. Testing is still incomplete.

  23. Pingback: 2009 Year-End Zeitgeist
  24. DJ

    @Greg. I am into a situation where my table CF having 500 million records.and daily insert and update on 6 million record on it.
    We have to do a daily gather stats on this table.But also considering how much time it takes otherwise it will increase our batch window size.

    There are two things which we tried

    begin
    dbms_stats.gather_table_stats(OWNNAME=>’XYZ’,
    TABNAME=>’CF’,
    METHOD_OPT=>’FOR ALL INDEXED COLUMNS SIZE 1′,
    ESTIMATE_PERCENT=>25,
    DEGREE =>8,
    CASCADE =>TRUE );
    end;
    This is taking 2.5 hrs
    begin
    dbms_stats.gather_table_stats( OWNNAME =>’EXAMIN_BI’,TABNAME =>’CHARGE_FACT’,ESTIMATE_PERCENT => 25 ,CASCADE => true,DEGREE =>8)
    end;
    This is taking 4.5 hrs.

    Not sure on what should we do . There are many reports which are hitting this fact table. We are having almost 10 compostie index on this table.

    Kindly share ur expertise.

  25. Greg Rahn

    @DJ
    First, why use 25% sample? Why not 5%? Why not 1%? Is that just a “magic” number you chose?

    What is CPU_COUNT for your system? Why use DEGREE 8 and not something larger?

    10 indexes on a fact table seems like the wrong approach to me. Why so many indexes? Is the number of columns in all the indexes larger than the number of columns in the table?

  26. DJ

    @Greg Rahn
    Thanks for you responswe Greg.
    @Greg we are working in a shared services environment. This is the cpu count provided to us from dba services.
    cpu_count:  64
    and why we are using ESTIMATE_PERCEN as 25% and DEGREE =>8 is a standard what shared services DBA recommends for any table gather stats.

    Out of 10 indexes 4 indexes are used by our ETL and rest all are required by reporting users.
    and actually I didn’t get last question.

    But definately I understand the use of degree and in talk with the dba’s to increase it for our job.

    Also want to understand METHOD_OPT default and ’FOR ALL INDEXED COLUMNS SIZE 1.

  27. Pingback: Gathering statistics with DBMS_STATS procedures « Mike Desouza's Blog
  28. Pingback: How To Gather Table Stats Effectively | Rants & Raves – The Blog!
  29. Dominic Brooks

    Hi Greg,

    I know this is an old article and I’ve read it a number of times but I’ve not noticed this comment before.

    The advice that dynamic sampling is not used for bind variables is just wrong.

    SQL> drop table t1;
    
    Table dropped.
    
    SQL> create table t1
      2  as
      3  select case when rownum = 1 then 'N' else 'Y' end flagcol
      4  from   dual
      5  connect by rownum <=50000;
    
    Table created.
    
    SQL>  exec dbms_stats.delete_table_stats(USER,'T1');
    
    PL/SQL procedure successfully completed.
    
    SQL> var x varchar2(1)
    SQL> exec :x := 'Y'
    
    PL/SQL procedure successfully completed.
    
    SQL> select /*+ gather_plan_statistics */
      2         count(*)
      3  from   t1
      4  where  flagcol = :x;
    
      COUNT(*)
    ----------
         49999
    
    SQL> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));
    
    PLAN_TABLE_OUTPUT
    ----------------------------------------------------------------------------------------------
    SQL_ID  49778jqxqcfpq, child number 0
    -------------------------------------
    select /*+ gather_plan_statistics */        count(*) from   t1 where
    flagcol = :x
    
    Plan hash value: 3724264953
    
    ----------------------------------------------------------------------------------------------
    | Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
    ----------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT   |      |      1 |        |      1 |00:00:00.05 |      81 |      6 |
    |   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.05 |      81 |      6 |
    |*  2 |   TABLE ACCESS FULL| T1   |      1 |  52391 |  49999 |00:00:00.05 |      81 |      6 |
    ----------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       2 - filter("FLAGCOL"=:X)
    
    Note
    -----
       - dynamic sampling used for this statement (level=4)
    
    
    24 rows selected.
    
    SQL> 
    

    Cheers,
    Dominic

  30. Greg Rahn

    @Dominic

    You are correct, dynamic sampling can be use with statements that contain bind variables (I’ll fix that comment – don’t recall exactly why I wrote that), however I think that OTN thread suggested that people had statements whose plans showed dynamic sampling for a literal value, but when changed to a bind variable dynamic sampling was not chosen, but there isn’t any restriction as such. Obviously if the table has no stats (as in your test case) or you use an explicit hint, dynamic sampling will kick in, even with bind variables.

  31. Kishore

    Hi Greg,
    Nice post.
    I have a table with 8 columns and one of the column is of BLOB datatype. I would like to exclude that column from the statistics gathering, as that column is hardly used. And it takes very long time to gather statistics. This table has about 600 mil records.
    I tried with Method_opt => ‘For all columns size auto for columns (col1,col2,col3,…)’
    But it giving me error, that col1 is invalid identifier.

    Any idea, how i can exclude that.

    Thanks.

    Kishore

  32. peter

    Hi Greg,

    In your example, you create a table with 1 million records using the following constructs:

    “table(counter(1,1000000))”

    What exactly is this? I couldn’t find a reference of this anywhere and I couldn’t get it to run in SQL*Plus.

    thanks
    - peter

  33. GP

    dynamic sampling is related to bind variables? bind variable use simply lowers the accuracy of cost based optimzation. if dynamic sampling is not used when bind variables are used then it reverts to rule based optimization does it not? dynamic sampling simply looks at a sample of dtaa when executing, using literals gives the query a better picture of where to look for data. if not using dynamic sampling what does it use? the existing statistics? which would still be CBO which doesn’t make sense why dynamic sampling can’t be used. are you saying that dynamic sampling requires literals to be invoked? why catn dynamic sampling be invoked just to tae a quick/small snapshot of data distribution, regardless of bind variables or not. what happens with cursor_sharing-FORCE?

  34. GP

    dynamic sampling not used for bind variables is incorrect? I am inclined to agree. It doesn’t make any sense.

  35. GP

    why bother with cardinality when full table scanning, unless histograms store physical addresses and allow a query to determine which rows are read based on physical addresses stored in histogram buckets? cardinality helps indexing not full scans. full scans read disk in contiguous blocks in physical order, random access, why would cardinality be an issue. And warehouses without indexes are are unindexed because they read all the data anyway and it’s faster to full scan the tables rather than read index + tables. In my experience, the only reason to index a warehouse is PKs on dimensions (could be uniques maybe) and non-constrained indexes on FK inversions in facts – all for kjoining and sometimes bitmaps on fact FKs (not really FKs) can help depending on data volatility. If there are no dimensions and/or there are mutliple facts and joins between them then the only thing that generally helps is partitioning under the assumption that data is read in full tables or full partitions. However, the next time I build a warehouse I will experiment with histograms on non keyed columns just to see what they do with optimized WHERE clauses, i do know for a fact that index searches on histogramm’ed indexed columns can make huge differences, but again, my overall strategy with a warehouse is often to remove all indexing and use partition keys (usually).

  36. Greg Rahn

    Have you actually read my prior response? The part where I say:”You should recognize the examples are not about full table scan or not.”

  37. Khurram

    Hi Greg,

    How did you know the degree of uneven distribution? I mean you used degree=>8 , how come you know the skewness for which you used degree=>8 , is it documented by oracle that what degree we need to use for what sort of data? Its not clear to me degree=>8 why not 9 and 10 …n?

    Khurram

  38. Khurram

    Greg great article , one thing i intend to clear that you said

    “Obviously if the table has no stats (as in your test case) or you use an explicit hint, dynamic sampling will kick in, even with bind variables.”

    dynamic simply will kick in either you have stats or not.

    SQL> create table regions2
      2  (region_id  varchar2(10) primary key,
      3   region_name varchar2(25))
      4  /
    
    Table created.
    
    SQL> insert into regions2 values ('1','A')
      2  /
    
    1 row created.
    
    SQL> insert into regions2 values (2,'B')
      2  /
    
    1 row created.
    
    SQL> insert into regions2 values ('3','C')
      2  /
    
    1 row created.
    
    SQL> insert into regions2 values ('4','D')
      2  /
    
    1 row created.
    
    SQL> commit
      2  /
    
    Commit complete.
    
    SQL> select * from regions2
      2  /
    
    REGION_ID  REGION_NAME
    ---------- -------------------------
    1          A
    2          B
    3          C
    4          D
    
    SQL> variable regid number
    SQL> exec :regid:=1
    
    PL/SQL procedure successfully completed.
    
    SQL> select /*+ gather_plan_statistics */ *
      2    from regions2
      3   where region_id=:regid
      4  /
    
    REGION_ID  REGION_NAME
    ---------- -------------------------
    1          A
    
    SQL> SELECT * FROM table(dbms_xplan.display_cursor(NULL,NULL,'ALLSTATS LAST'))
      2  /
    
    PLAN_TABLE_OUTPUT
    ----------------------------------------------------------------------------------------------------
    SQL_ID  27wd2gm8ruukh, child number 0
    -------------------------------------
    select /*+ gather_plan_statistics */ *   from regions2   where region_id=:regid
    
    Plan hash value: 670750275
    
    ----------------------------------------------------------------------------------------
    | Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
    ----------------------------------------------------------------------------------------
    |*  1 |  TABLE ACCESS FULL| REGIONS2 |      1 |      1 |      1 |00:00:00.01 |      16 |
    ----------------------------------------------------------------------------------------
    
    PLAN_TABLE_OUTPUT
    ----------------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       1 - filter(TO_NUMBER("REGION_ID")=:REGID)
    
    Note
    -----
       - dynamic sampling used for this statement
    
    
    21 rows selected.
    
    SQL> exec dbms_stats.gather_table_stats('APPS','REGIONS2',cascade=>true,method_opt=>'for all columns size 254')
    
    PL/SQL procedure successfully completed.
    
    SQL> select /*+ gather_plan_statistics */ *
      2    from regions2
      3    where region_id=:regid
      4  /
    
    REGION_ID  REGION_NAME
    ---------- -------------------------
    1          A
    
    SQL> SELECT * FROM table(dbms_xplan.display_cursor(NULL,NULL,'ALLSTATS LAST'))
      2  /
    
    PLAN_TABLE_OUTPUT
    ----------------------------------------------------------------------------------------------------
    SQL_ID  27wd2gm8ruukh, child number 0
    -------------------------------------
    select /*+ gather_plan_statistics */ *   from regions2   where region_id=:regid
    
    Plan hash value: 670750275
    
    ----------------------------------------------------------------------------------------
    | Id  | Operation         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
    ----------------------------------------------------------------------------------------
    |*  1 |  TABLE ACCESS FULL| REGIONS2 |      1 |      1 |      1 |00:00:00.01 |      16 |
    ----------------------------------------------------------------------------------------
    
    PLAN_TABLE_OUTPUT
    ----------------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       1 - filter(TO_NUMBER("REGION_ID")=:REGID)
    
    Note
    -----
       - dynamic sampling used for this statement
    
    
    21 rows selected.
    

    Khurram

  39. Greg Rahn

    Do note, you are taking that comment out of context. It is 100% correct as stated, but it is not meant to cover 100% of the use cases.
    Notice that I say if there are no stats, then dynamic sampling is guaranteed to kick in (given default parameters and it meets the criteria), however, I do not say dynamic sampling won’t kick in if the table does have stats. Stating the positive does not negate the negative.
    It’s all in the documentation: http://docs.oracle.com/cd/E11882_01/server.112/e16638/stats.htm#CHDHAFBJ

    In your case dynamic sampling kicks in no matter what because you have a type conversion: the bind variable is a number, but the column is a string.

  40. JOHN

    Hi,
    Guys What do you mean by SKEW OPTION in method_opt parameter. And What should i do to gathering statistics in a very minimum time. i got table which has 12 lacs records so, what should i do?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s