Tagged: gather_table_stats

DBMS_STATS, METHOD_OPT and FOR ALL INDEXED COLUMNS

I’ve written before on choosing an optimal stats gathering strategy but I recently came across a scenario that I didn’t directly blog about and think it deserves attention. As I mentioned in that previous post, one should only deviate from the defaults when they have a reason to, and fully understand that reason and the effect of that decision.

Understanding METHOD_OPT

The METHOD_OPT parameter of DBMS_STATS controls two things:

  1. on which columns statistics will be collected
  2. on which columns histograms will be collected (and how many buckets)

It is very important to understand #1 and how the choice of METHOD_OPT effects the collection of column statistics.

Prerequisite: Where Do I Find Column Statistics?

Understanding where to find column statistics is vital for troubleshooting bad execution plans. These views will be the arrows in your quiver:

  • USER_TAB_COL_STATISTICS
  • USER_PART_COL_STATISTICS
  • USER_SUBPART_COL_STATISTICS

Depending on if the table is partitioned or subpartitioned, and depending on what GRANULARITY the stats were gathered with, the latter two of those views may or may not be populated.

The Bane of METHOD_OPT: FOR ALL INDEXED COLUMNS

If you are using FOR ALL INDEXED COLUMNS as part of your METHOD_OPT you probably should not be. Allow me to explain. Using MENTOD_OPT=>'FOR ALL INDEXED COLUMNS SIZE AUTO' (a common METHOD_OPT I see) tells DBMS_STATS: “only gather stats on columns that participate in an index and based on data distribution and the workload of those indexed columns decide if a histogram should be created and how many buckets it should contain“. Is that really what you want? My guess is probably not. Let me work through a few examples to explain why.

I’m going to start with this table.

SQL> exec dbms_random.initialize(1);

PL/SQL procedure successfully completed.

SQL> create table t1
  2  as
  3  select
  4    column_value                    pk,
  5    round(dbms_random.value(1,2))   a,
  6    round(dbms_random.value(1,5))   b,
  7    round(dbms_random.value(1,10))  c,
  8    round(dbms_random.value(1,100)) d,
  9    round(dbms_random.value(1,100)) e
 10  from table(counter(1,1000000))
 11  /

Table created.

SQL> begin
  2    dbms_stats.gather_table_stats(
  3      ownname => user ,
  4      tabname => 'T1' ,
  5      estimate_percent => 100 ,
  6      cascade => true);
  7  end;
  8  /

PL/SQL procedure successfully completed.

SQL> select
  2    COLUMN_NAME, NUM_DISTINCT, HISTOGRAM, NUM_BUCKETS,
  3    to_char(LAST_ANALYZED,'yyyy-dd-mm hh24:mi:ss') LAST_ANALYZED
  4  from user_tab_col_statistics
  5  where table_name='T1'
  6  /

COLUMN_NAME NUM_DISTINCT HISTOGRAM       NUM_BUCKETS LAST_ANALYZED
----------- ------------ --------------- ----------- -------------------
PK               1000000 NONE                      1 2008-13-10 18:39:51
A                      2 NONE                      1 2008-13-10 18:39:51
B                      5 NONE                      1 2008-13-10 18:39:51
C                     10 NONE                      1 2008-13-10 18:39:51
D                    100 NONE                      1 2008-13-10 18:39:51
E                    100 NONE                      1 2008-13-10 18:39:51

6 rows selected.

This 6 column table contains 1,000,000 rows of randomly generated numbers. I’ve queried USER_TAB_COL_STATISTICS to display some of the important attributes (NDV, Histogram, Number of Buckets, etc).

I’m going to now put an index on T1(PK), delete the stats and recollect stats using two different METHOD_OPT parameters that each use 'FOR ALL INDEXED COLUMNS'.

SQL> create unique index PK_T1 on T1(PK);

Index created.

SQL> begin
  2    dbms_stats.delete_table_stats(user,'T1');
  3
  4    dbms_stats.gather_table_stats(
  5      ownname => user ,
  6      tabname => 'T1' ,
  7      estimate_percent => 100 ,
  8      method_opt => 'for all indexed columns' ,
  9      cascade => true);
 10  end;
 11  /

PL/SQL procedure successfully completed.

SQL> select COLUMN_NAME, NUM_DISTINCT, HISTOGRAM, NUM_BUCKETS,
  2  to_char(LAST_ANALYZED,'yyyy-dd-mm hh24:mi:ss') LAST_ANALYZED
  3  from user_tab_col_statistics
  4  where table_name='T1'
  5  /

COLUMN_NAME NUM_DISTINCT HISTOGRAM       NUM_BUCKETS LAST_ANALYZED
----------- ------------ --------------- ----------- -------------------
PK               1000000 HEIGHT BALANCED          75 2008-13-10 18:41:10

SQL> begin
  2    dbms_stats.delete_table_stats(user,'T1');
  3
  4    dbms_stats.gather_table_stats(
  5      ownname => user ,
  6      tabname => 'T1' ,
  7      estimate_percent => 100 ,
  8      method_opt => 'for all indexed columns size auto' ,
  9      cascade => true);
 10  end;
 11  /

PL/SQL procedure successfully completed.

SQL> select COLUMN_NAME, NUM_DISTINCT, HISTOGRAM, NUM_BUCKETS,
  2  to_char(LAST_ANALYZED,'yyyy-dd-mm hh24:mi:ss') LAST_ANALYZED
  3  from user_tab_col_statistics
  4  where table_name='T1'
  5  /

COLUMN_NAME NUM_DISTINCT HISTOGRAM       NUM_BUCKETS LAST_ANALYZED
----------- ------------ --------------- ----------- -------------------
PK               1000000 NONE                      1 2008-13-10 18:41:12

Notice that in both cases only column PK has stats on it. Columns A,B,C,D and E do not have any stats collected on them. Also note that when no SIZE clause is specified, it defaults to 75 buckets.

Now one might think that is no big deal or perhaps they do not realize this is happening because they do not look at their stats. Let’s see what we get for cardinality estimates from the Optimizer for a few scenarios.

SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5  /

  COUNT(*)
----------
    500227

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  4df0g0r99zmba, child number 0
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.24 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |  10000 |    500K|00:00:00.50 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("A"=1)

Notice the E-Rows estimate for T1. The Optimizer is estimating 10,000 rows when in reality there is 500,227. The estimate is off by more than an order of magnitude (50x). Normally the calculation for the cardinality would be (for a one table single equality predicate):
number of rows in T1 * 1/NDV = 1,000,000 * 1/2 = 500,000
but in this case 10,000 is the estimate. Strangely enough (or not), 10,000 is exactly 0.01 (1%) of 1,000,000. Because there are no column stats for T1.A, the Optimizer is forced to make a guess, and that guess is 1%.

As you can see from the 10053 trace (below), since there are no statistics on the column, defaults are used. In this case they yield very poor cardinality estimations.

SINGLE TABLE ACCESS PATH
  -----------------------------------------
  BEGIN Single Table Cardinality Estimation
  -----------------------------------------
  Column (#2): A(NUMBER)  NO STATISTICS (using defaults)
    AvgLen: 13.00 NDV: 31250 Nulls: 0 Density: 3.2000e-05
  Table: T1  Alias: T1
    Card: Original: 1000000  Rounded: 10000  Computed: 10000.00  Non Adjusted: 10000.00
  -----------------------------------------
  END   Single Table Cardinality Estimation
  -----------------------------------------

Now that I’ve demonstrated how poor the cardinality estimation was with a single equality predicate, let’s see what two equality predicates gives us for a cardinality estimate.

SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5    and b=3
  6  /

  COUNT(*)
----------
    124724

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  ctq8q59qdymw6, child number 0
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1   and b=3

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.19 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |    100 |    124K|00:00:00.25 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("A"=1 AND "B"=3))

Yikes. In this case the cardinality estimate is 100 when the actual number of rows is 124,724, a difference of over 3 orders of magnitude (over 1000x). Where did the 100 row estimate come from? In this case there are two equality predicates so the selectivity is calculated as 1% * 1% or 0.01 * 0.01 = 0.0001. 1,000,000 * 0.0001 = 100. Funny that. (The 1% is the default selectivity for an equality predicate w/o stats.)

Now let’s add a derived predicate as well and check the estimates.

SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5    and b=3
  6    and d+e > 50
  7  /

  COUNT(*)
----------
    109816

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  5x200q9rqvvfu, child number 0
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1   and b=3
 and d+e > 50

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.22 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |      5 |    109K|00:00:00.33 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("A"=1 AND "B"=3 AND "D"+"E">50))

Doh! The cardinality estimate is now 5, but the actual number of rows being returned is 109,816. Not good at all. The Optimizer estimated 5 rows because it used a default selectivity of 1% (for A=1) * 1% (for B=3) * 5% (for D+E > 50) * 1,000,000 rows. Now can you see why column statistics are very important? All it takes is a few predicates and the cardinality estimation becomes very small, very fast. Now consider this:

  • What is likely to happen in a data warehouse where the queries are 5+ table joins and the fact table columns do not have indexes?
  • Would the Optimizer choose the correct driving table?
  • Would nested loops plans probably be chosen when it is really not appropriate?

Hopefully you can see where this is going. If you don’t, here is the all too common chain of events:

  • Non representative (or missing) statistics lead to
  • Poor cardinality estimates which leads to
  • Poor access path selection which leads to
  • Poor join method selection which leads to
  • Poor join order selection which leads to
  • Poor SQL execution times

Take 2: Using the Defaults

Now I’m going to recollect stats with a default METHOD_OPT and run through the 3 execution plans again:

SQL> begin
  2    dbms_stats.delete_table_stats(user,'t1');
  3
  4    dbms_stats.gather_table_stats(
  5      ownname => user ,
  6      tabname => 'T1' ,
  7      estimate_percent => 100 ,
  8      degree => 8,
  9      cascade => true);
 10  end;
 11  /

PL/SQL procedure successfully completed.

SQL> select column_name, num_distinct, histogram, NUM_BUCKETS,
  2  to_char(LAST_ANALYZED,'yyyy-dd-mm hh24:mi:ss') LAST_ANALYZED
  3  from user_tab_col_statistics where table_name='T1'
  4  /

COLUMN_NAME NUM_DISTINCT HISTOGRAM       NUM_BUCKETS LAST_ANALYZED
----------- ------------ --------------- ----------- -------------------
PK               1000000 NONE                      1 2008-13-10 19:44:32
A                      2 FREQUENCY                 2 2008-13-10 19:44:32
B                      5 FREQUENCY                 5 2008-13-10 19:44:32
C                     10 FREQUENCY                10 2008-13-10 19:44:32
D                    100 NONE                      1 2008-13-10 19:44:32
E                    100 NONE                      1 2008-13-10 19:44:32

6 rows selected.
SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5  /

  COUNT(*)
----------
    500227

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  4df0g0r99zmba, child number 0
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.20 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |    500K|    500K|00:00:00.50 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("A"=1)
SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5    and b=3
  6  /

  COUNT(*)
----------
    124724

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  ctq8q59qdymw6, child number 0
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1   and b=3

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.14 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |    124K|    124K|00:00:00.25 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("B"=3 AND "A"=1))
SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5    and b=3
  6    and d+e > 50
  7  /

  COUNT(*)
----------
    109816

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  5x200q9rqvvfu, child number 0
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1   and b=3
 and d+e>50

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.17 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |   6236 |    109K|00:00:00.22 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("B"=3 AND "A"=1 AND "D"+"E">50))

As you can see, the first two queries have spot on cardinality estimates, but the the third query isn’t as good as it uses a column combination and there are no stats on D+E columns, only D and E individually. I’m going to rerun the third query with dynamic sampling set to 4 (in 10g it defaults to 2) and reevaluate the cardinality estimate.

SQL> alter session set optimizer_dynamic_sampling=4;

Session altered.

SQL> select /*+ gather_plan_statistics */
  2    count(*)
  3  from t1
  4  where a=1
  5    and b=3
  6    and d+e > 50
  7  /

  COUNT(*)
----------
    109816

SQL> select * from table(dbms_xplan.display_cursor(null, null, 'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------
SQL_ID  5x200q9rqvvfu, child number 1
-------------------------------------
select /*+ gather_plan_statistics */   count(*) from t1 where a=1   and b=3
 and d+e > 50

Plan hash value: 3724264953

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.17 |    3466 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |    102K|    109K|00:00:00.22 |    3466 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("B"=3 AND "A"=1 AND "D"+"E">50))

Note
-----
   - dynamic sampling used for this statement

Bingo! Close enough to call statistically equivalent.

Summary

I hope this little exercise demonstrates how important it is to have representative statistics and that when statistics are representative the Optimizer can very often accurately estimate the cardinality and thus choose the best plan for the query. Remember these points:

  • Recent statistics do not necessarily equate to representative statistics.
  • Statistics are required on all columns to yield good plans – not just indexed columns.
  • You probably should not be using METHOD_OPT => 'FOR ALL INDEXED COLUMNS SIZE AUTO', especially in a data warehouse where indexes are used sparingly.
  • Dynamic Sampling can assist with cardinality estimates where existing stats are not enough.

Tests performed on 10.2.0.4

Oracle 11g: Incremental Global Statistics On Partitioned Tables

Previously I blogged about the new and improved DBMS_STATS.AUTO_SAMPLE_SIZE used to calculate NDV in Oracle 11g and now I wanted to touch on another new feature of DBMS_STATS in 11g: Incremental Global Statistics On Partitioned Tables.

Before Incremental Global Stats (Two-Pass Method)

When DBMS_STATS.GATHER_TABLE_STATS collects statistics on a partitioned table, generally it does so at the partition and table (global) level (the default behavior can be modified by changing the GRANULARITY parameter). This is done in two steps. First, partition level stats are gathered by scanning the partition(s) that have stale or empty stats, then a full table scan is executed to gather the global statistics. As more partitions are added to a given table, the longer the execution time for GATHER_TABLE_STATS, due to the full table scan requited for global stats.

Using Incremental Global Stats (Synopsis-Based Method)

Incremental Global Stats works by collecting stats on partitions and storing a synopsis which is the statistics metadata for that partition and the columns for that partition. This synopsis is stored in the SYSAUX tablespace, but is quite small (only a few kilobytes). Global stats are then created not by reading the entire table, but by aggregating the synopses from each partition. Incremental Global Stats, in conjunction with the new 11g DBMS_STATS.AUTO_SAMPLE_SIZE, yield a significant reduction in the time to collect statistics and produce near perfect accuracy.

Turning On Incremental Global Stats

Incremental Global Stats can only be used for partitioned tables and is activated by this command:

SQL> exec DBMS_STATS.SET_TABLE_PREFS(user,'FOO','INCREMENTAL','TRUE')

-- To see the value for INCREMENTAL for a given table:

SQL> select dbms_stats.get_prefs('INCREMENTAL', tabname=>'FOO') from dual;

DBMS_STATS.GET_PREFS('INCREMENTAL',TABNAME=>'FOO')
--------------------------------------------------
TRUE

You may also use any of the other DBMS_STATS.SET_*_PREFS as well.

A Real-World Example

To demonstrate the benefit of Incremental Global Statistics, I created a range partitioned table consisting of 60 range partitions. The target table starts empty and one million (1,000,000) rows are inserted into a single partition of the table and then statistics are gathered. This is done 60 times, simulating loading 60 one day partitions (one at a time) emulating a daily ETL/ELT process over 60 days.

Incremental_Stats.png

Elapsed Times
Partitions
Incremental=FALSE
Incremental=TRUE
1
00:00:20.36
00:00:21.14
10
00:02:27.25
00:00:37.76
20
00:04:46.23
00:00:49.83
30
00:07:05.47
00:01:01.80
40
00:09:11.09
00:01:23.33
50
00:11:33.18
00:01:30.40
60
00:13:18.15
00:01:40.28
Cumulative Elapsed Time
06:42:21.20
01:00:53.80

As you can see from the chart and the table, without Incremental Global Stats the time to gather stats increases pretty much linearly with the number of partitions, but with Incremental Global Stats the elapse time only slightly increases. The big difference is in the cumulative elapsed time: It takes 6 hours 42 minutes without Incremental Global Stats, but only 1 hour with. Quite a significant savings over time!

Revisiting The Math

For this experiment the time to gather stats without Incremental Global Stats is:
(time to scan & gather for 1 partition) + (time to scan and gather for entire table)
When Incremental Global Stats is used the time to gather stats is:
(time to scan & gather for 1 partition) + (time to aggregate all synopses)

The Diff Test

I exported the stats into a stats table and then ran the diff to compare the two runs. This will show us how comparable the two methods of stats gathering are.

SQL> set long 500000 longchunksize 500000
SQL> select report, maxdiffpct from
     table(dbms_stats.diff_table_stats_in_stattab(user,'CATALOG_SALES','STATS_DEFAULT','STATS_INC'));

REPORT
------------------------------------------------------------------------------------
MAXDIFFPCT
----------
###############################################################################

STATISTICS DIFFERENCE REPORT FOR:
.................................

TABLE	      : CATALOG_SALES
OWNER	      : TPCDS
SOURCE A      : User statistics table STATS_DEFAULT
	      : Statid	   :
	      : Owner	   : TPCDS
SOURCE B      : User statistics table STATS_INC
	      : Statid	   :
	      : Owner	   : TPCDS
PCTTHRESHOLD  : 10
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


NO DIFFERENCE IN TABLE / (SUB)PARTITION STATISTICS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

COLUMN STATISTICS DIFFERENCE:
.............................

COLUMN_NAME	SRC NDV     DENSITY    HIST NULLS   LEN  MIN   MAX   SAMPSIZ
...............................................................................

CS_BILL_ADDR_SK A   1001152 .000000998 NO   148640  5	 C102  C402  5.9E+07
		B   1001176 .000000998 NO   148613  5	 C102  C402  5.9E+07
CS_BILL_CDEMO_S A   1868160 .000000535 NO   148646  6	 C102  C4025 5.9E+07
		B   1878320 .000000532 NO   148753  6	 C102  C4025 5.9E+07
CS_BILL_CUSTOME A   1942528 .000000514 NO   148104  6	 C102  C403  5.9E+07
		B   1949464 .000000512 NO   148192  6	 C102  C403  5.9E+07
CS_BILL_HDEMO_S A   7200    .000138888 NO   148227  4	 C102  C249  5.9E+07
		B   7200    .000138888 NO   148250  4	 C102  C249  5.9E+07
CS_CALL_CENTER_ A   30	    .033333333 NO   148310  3	 C102  C11F  5.9E+07
		B   30	    .033333333 NO   148272  3	 C102  C11F  5.9E+07
CS_CATALOG_PAGE A   11092   .000090155 NO   148111  5	 C102  C3023 5.9E+07
		B   11092   .000090155 NO   148154  5	 C102  C3023 5.9E+07
CS_EXT_LIST_PRI A   1133824 .000000881 NO   148461  6	 C102  C3036 5.9E+07
		B   1131680 .000000883 NO   148368  6	 C102  C3036 5.9E+07
CS_EXT_WHOLESAL A   394880  .000002532 NO   148842  5	 C102  C302  5.9E+07
		B   394880  .000002532 NO   148772  5	 C102  C302  5.9E+07
CS_ITEM_SK	A   205888  .000004857 NO   0	    5	 C102  C3152 5.9E+07
		B   205408  .000004868 NO   0	    5	 C102  C3152 5.9E+07
CS_LIST_PRICE	A   29896   .000033449 NO   148438  5	 C102  C204  5.9E+07
		B   29896   .000033449 NO   148458  5	 C102  C204  5.9E+07
CS_ORDER_NUMBER A   7151104 .000000139 NO   0	    6	 C102  C4102 5.9E+07
		B   7122072 .000000140 NO   0	    6	 C102  C4102 5.9E+07
CS_PROMO_SK	A   1000    .001       NO   148617  4	 C102  C20B  5.9E+07
		B   1000    .001       NO   148693  4	 C102  C20B  5.9E+07
CS_QUANTITY	A   100     .01        NO   148737  3	 C102  C202  5.9E+07
		B   100     .01        NO   148751  3	 C102  C202  5.9E+07
CS_SHIP_ADDR_SK A   1001088 .000000998 NO   148150  5	 C102  C402  5.9E+07
		B   1001152 .000000998 NO   148235  5	 C102  C402  5.9E+07
CS_SHIP_CDEMO_S A   1870592 .000000534 NO   148918  6	 C102  C4025 5.9E+07
		B   1878272 .000000532 NO   148862  6	 C102  C4025 5.9E+07
CS_SHIP_CUSTOME A   1938816 .000000515 NO   148300  6	 C102  C403  5.9E+07
		B   1948928 .000000513 NO   148309  6	 C102  C403  5.9E+07
CS_SHIP_DATE_SK A   1884    .000530785 NO   148674  6	 C4032 C4032 5.9E+07
		B   1884    .000530785 NO   148608  6	 C4032 C4032 5.9E+07
CS_SHIP_HDEMO_S A   7200    .000138888 NO   148172  4	 C102  C249  5.9E+07
		B   7200    .000138888 NO   148161  4	 C102  C249  5.9E+07
CS_SHIP_MODE_SK A   20	    .05        NO   148437  3	 C102  C115  5.9E+07
		B   20	    .05        NO   148486  3	 C102  C115  5.9E+07
CS_SOLD_DATE_SK A   1595    .000626959 NO   0	    6	 C4032 C4032 5.9E+07
		B   1587    .000630119 NO   0	    6	 C4032 C4032 5.9E+07
CS_WAREHOUSE_SK A   15	    .066666666 NO   148651  3	 C102  C110  5.9E+07
		B   15	    .066666666 NO   148620  3	 C102  C110  5.9E+07
CS_WHOLESALE_CO A   9901    .000100999 NO   149054  4	 C102  C202  5.9E+07
		B   9901    .000100999 NO   149099  4	 C102  C202  5.9E+07

The stats diff shows that for many columns the NDV is identical and the others are statistically equivalent (close enough to be the same). I will certainly be adding this feature to my “conviction must use list” for Oracle 11g.

Further Reading

If you are interested in the bits and bytes of how the synopsis-based method works, I would suggest you read the whitepaper, Efficient and Scalable Statistics Gathering for Large Databases in Oracle 11g that was presented on this topic at SIGMOD 2008.