Category: Oracle

Oracle OpenWorld 2011: The Oracle Real-World Performance Group

The summer is flying by and in no time it will be October and that means Oracle OpenWorld 2011 should be on your radar. Once again the Oracle Real-World Performance Group will be hosting three sessions.  For those unfamiliar with our presentations, you will get marketing free, no nonsense performance insight of the highest caliber from Oracle’s most elite database performance engineers — the kind of things hard core Oracle people want to hear about.  Hope to see you there!

Session ID: 13641 (Wednesday Oct. 5th, 10:00)
Session Title: Real-World Performance: The Forklift Upgrade
Session Abstract: Today the motivation to consolidate and migrate existing data and applications into the extreme-high-performance database environments of Oracle Exadata and Oracle Exalogic is being driven by a desire to reduce costs and deliver increased performance and service levels to users. The process is often called a forklift migration, because applications and data are simply picked up and lifted onto new platforms.In this session, Oracle’s Real World Performance group describes how best to maximize your investment and achieve world-class performance and discusses the challenges and compromises that need to be made in the process.
Session ID: 13643 (Tuesday Oct. 4th 13:15)
Session Title: Real-World Performance: How Oracle Does It
Session Abstract: Oracle’s Real-World Performance Group has been achieving world-class performance for clients for more than 15 years. In this session, some of the senior engineers describe the techniques, philosophy, and tools they use to address performance challenges. The session is packed with demos and examples to convey the pragmatism required in real-world performance today.
Session ID: 13640 (Thursday Oct. 6th 10:30)
Session Title: Real-World Performance Questions and Answers
Session Abstract: This is your chance to pose specific questions to Oracle’s Real-World Performance Group. All questions need to be in writing (yes, cards will be provided) and should relate to database performance challenges. In the past, the best questions have been related to specific topics of interest to the audience.

Data Science Fun with the OOW Mix Session Voting Data

Over the past few weeks Oracle Mix had opened the Oracle OpenWorld 2011 Suggest-a-Session to the general public where anyone could submit or vote on a session. One limitation of the Oracle Mix site was that it was impossible to sort the sessions by votes but that challenge was tackled by Roel Hartman with his blog post and APEX demo. After seeing the top session by votes, it was very interesting to me that around half of the top 15 sessions were all from the same author. That got me thinking…and that thinking turned into a little data hacking project that I embarked on. Now I admit it, I think data is very cool, and even cooler is extracting patterns and neat information from data.

Getting the Data

The Oracle Mix site is very “crawler friendly” — it has well defined code and tags which made extracting the data fairly painless. The basic process I used came down to this:

  1. Get the listing of all the session proposals. That was done by going to the Mix main proposal page and walking all 43 pages of submissions, scraping the direct URL to each session.
  2. Now that I had all of the session abstract URLs, grab each of those pages, all 424 of them
  3. From each session page, extract the relevant bits: Session Name, Session Author, Total Vote Count, and most importantly, who voted for this session.

I did all of that with curl, wget and some basic regex as a “version 1″ but was hoping to go back and try it again using some more sexy technology like Beautiful Soup. That will have to be continued…

The Social Network Effect

With Oracle Mix Suggest-a-Session, people generally vote for a session for one of two reasons:

  1. They are generally interested in the session topic
  2. The session author asked them to vote because of their social relationship

What I think is interesting to know is just how much of the voting is done because of #2. After all, Oracle Mix is a social networking site so there certainly is some voting for that reason. In fact, one of the session authors, Yury Velikanov from Pythian, even blogged his story of rounding up votes. The data shows us this, but more on that in just a bit…

The (Unofficial) Data

I took some time to mingle around the data and found some very interesting things. Let’s just start with a few high level points:

  • There were 424 sessions submitted from 252 different authors.
  • There were 10,125 votes from 2,447 unique voters.
  • The number of submissions ranged from 1 to 24 per author.

Here are some interesting tidbits I extracted from the data set (apologize for not making a cool visualization chart of all this – but I’ll make up for it later):

-- top 10 sessions by total votes:
+-------------+-----------------+--------------------------------------------------------------------------------+
| total_votes | session_author  | title                                                                          |
+-------------+-----------------+--------------------------------------------------------------------------------+
|         167 | tariq farooq    | Oracle RAC Interview Q/A Interactive Competition                               |
|         156 | tariq farooq    | Database Performance Tuning: Getting the best out of Oracle Enterprise Manager |
|         137 | tariq farooq    | Overview & Implementation of Clustering & High Availability with Oracle VM     |
|         130 | tariq farooq    | Migrate Your Online Oracle Database to RAC Using Streams and Data Pump         |
|         127 | tariq farooq    | 360 Degrees - Achieving High Availability for Enterprise Manager Grid Control  |
|         126 | yury velikanov  | Oracle11G SCAN: Sharing successful implementation experience                   |
|         124 | sandip nikumbha | Accelerated Interface Development Approach - Integration Framework             |
|         123 | tariq farooq    | Oracle VM: Overview, Architecture, Deployment Planning & Demo/Exercise         |
|         123 | sandip nikumbha | Remote SOA - Siebel Local Web Services Implementation                          |
|         119 | yury velikanov  | AWR Performance data mining                                                    |
+-------------+-----------------+--------------------------------------------------------------------------------+

-- top 10 voters (who place the most votes)
+--------------------+--------------+
| voter_name         | votes_placed |
+--------------------+--------------+
| arup nanda         |           53 |
| tariq farooq       |           43 |
| connie cservenyak  |           36 |
| xiaohuan xue       |           36 |
| bruce elliott      |           36 |
| peter khoury       |           35 |
| yugant patra       |           35 |
| balamohan manickam |           35 |
| suresh kuna        |           34 |
| eddie awad         |           34 |
+--------------------+--------------+

-- top 10 voters by unique session authors (how many unique authors did they vote for?)
+--------------------+----------------+
| name               | unique_authors |
+--------------------+----------------+
| arup nanda         |             28 |
| paul guerin        |             24 |
| eddie awad         |             24 |
| bruce elliott      |             23 |
| xiaohuan xue       |             23 |
| connie cservenyak  |             23 |
| peter khoury       |             22 |
| wai ling ng        |             22 |
| yugant patra       |             22 |
| balamohan manickam |             22 |
+--------------------+----------------+

-- top 10 session authors by total votes received, number of sessions, avg votes per session
+---------------------+-------------+----------+-----------------------+
| session_author      | total_votes | sessions | avg_votes_per_session |
+---------------------+-------------+----------+-----------------------+
| tariq farooq        |        1057 |        8 |              132.1250 |
| yury velikanov      |         557 |        5 |              111.4000 |
| alex gorbachev      |         429 |        6 |               71.5000 |
| sandip nikumbha     |         360 |        3 |              120.0000 |
| syed jaffar hussain |         281 |        4 |               70.2500 |
| kristina troutman   |         233 |        5 |               46.6000 |
| russell tront       |         221 |        3 |               73.6667 |
| wendy chen          |         217 |        3 |               72.3333 |
| asif momen          |         184 |        2 |               92.0000 |
| alison coombe       |         183 |        5 |               36.6000 |
+---------------------+-------------+----------+-----------------------+

Diving In Deeper

I could not help noticing that Tariq Farooq had the top 5 spots by total vote count. I would assert that is related to these two points:

  1. Tariq has some very interesting and apealing sessions
  2. Tariq has lots of friends who voted for his sessions

I have no doubt there there is some of both in the mix, but just how much influence on the votes is there from a person’s circle of friends? Or to put another way: How many voters only voted for a single session author? Or even more interesting, how many people voted for every session for a single author, and voted for no other sessions? All good questions…with answers that reside in the data!

-- number of users who voted for exactly one author
+---------------------------+
| users_voting_for_1_author |
+---------------------------+
|                       828 |
+---------------------------+

-- number of voters who voted for every session by a given author
-- and total # of votes per voter is the same # as sessions by an author
+-------------------------------------------------+
| users_who_voted_for_every_session_of_an_author |
+-------------------------------------------------+
|                                             826 |
+-------------------------------------------------+

Wow – now that interesting! Of people only voting for a single session author, just two of them did not vote for every one of that author’s sessions. That’s community for you!

Visualizing the Voting Graph

I was very interested to see what the Mix Voting Graph looked liked, so I imported the voting data into Gephi and rendered a network graph. What I was in search of was to identify the community structure of the voting community. Gephi lets you do this by partitioning the graph into modularity classes so that the communities become visible. This process is similar to how the LinkedIn InMap breaks your professional network into different communities.

Here is what the Oracle Mix voting community looks like:
mix_voting_graph

This is a great visualization of the communities and it accentuates the data from above – the voters who only voted for a single author. This can be seen by the small nodes on the outer part of the graph that have just a single edge between it and the session author’s node. Good examples of this are for Yury Velikanov and Tariq Farooq. Also clearly visible is what I’d refer to the “Pythian and friends” community centered around Alex Gorbachev and Yury Velikanov in the darker green color. There are also several other distinct communities and the color coding makes that visible.

Shouts Out

This is my first real data hacking attempt with web data and using some of the tools like Gephi for the graph analysis. One of my inspirations was Neil Kodner‘s Hadoop World 2010 Tweet Analysis, so I need to give a big shout out to Neil for that as well as his help with Gephi. Thanks Neil!

And One Last Thing

So what are people’s sessions about that were submitted? This Wordle says quite a bit.

Source

If you wish to play on you own: https://github.com/grahn/oow-vote-hacking

Addendum

Here is another graph where the edges are weighed according to votes to an author (obviously related to number of sessions for that author).
mix_vote_graph_weighted_edges

Real-World Performance Videos on YouTube – Data Warehousing

Here are some videos of a data warehouse demo that the Real-World Performance Group has been running for a while now and we thought it was time to put them on YouTube. Hope you find them informative.

Migrate a 1TB Data warehouse in 20 Minutes (Part 1)

Migrate a 1TB Data warehouse in 20 Minutes (Part 2)

Migrate a 1TB Data warehouse in 20 Minutes (Part 3)

Migrate a 1TB Data warehouse in 20 Minutes (Part 4)

Implicit Datatype Conversion + Histograms = Bad Execution Plan?

Earlier today I exchanged some tweets with @martinberx about some optimizer questions and after posting more information on the ORACLE-L list, I was able to reproduce what he was observing.

The issue:

DB: 11.2.0.2.0 – 64bit
I have a small query with a little error, which causes big troubles.
The relevant part of the query is
WHERE ….
AND inst_prod_type=003
AND setid=’COM01′

but INST_PROD_TYPE is VARCHAR2.

this leads to filter[ (TO_NUMBER(“INST_PROD_TYPE”)=3 AND “SETID”=’COM01′) ]

based on this TO_NUMBER ( I guess!) the optimiser takes a fix selectivity of 1%.

Can someone tell me if this 1% is right? Jonathan Lewis “CBO Fundamentals” on page 133 is only talking about character expressions.

Unfortunately there are only 2 distinct values of INST_PROD_TYPE so this artificial [low] selectivity leads to my problem:
An INDEX SKIP SCAN on PS0RF_INST_PROD is choosen. (columns of PS0RF_INST_PROD: INST_PROD_TYPE, SETID, INST_PROD_ID )

After fixing the statement to
AND inst_prod_type=’003′
another index is used and the statement performs as expected.

Now I have no problem, but want to find the optimizers decisions in my 10053 traces.

The Important Bits of Information

From Martin’s email we need to pay close attention to:

  • Predicate of “inst_prod_type=003″ where INST_PROD_TYPE is VARCHAR2 (noting no single quotes around 003)
  • Implicite datatype conversion in predicate section of explain plan – TO_NUMBER(“INST_PROD_TYPE”)=3
  • only 2 distinct values of INST_PROD_TYPE

From this information I’ll construct the following test case:

create table foo (c1 varchar2(8));
insert into foo select '003' from dual connect by level <= 1000000;
insert into foo select '100' from dual connect by level <= 1000000;
commit;
exec dbms_stats.gather_table_stats(user,'foo');

And using the display_raw function we’ll look at the column stats.

col low_val     for a8
col high_val    for a8
col data_type   for a9
col column_name for a11

select
   a.column_name,
   display_raw(a.low_value,b.data_type) as low_val,
   display_raw(a.high_value,b.data_type) as high_val,
   b.data_type,
   a.density,
   a.histogram,
   a.num_buckets
from
   user_tab_col_statistics a, user_tab_cols b
where
   a.table_name='FOO' and
   a.table_name=b.table_name and
   a.column_name=b.column_name
/

COLUMN_NAME LOW_VAL  HIGH_VAL DATA_TYPE    DENSITY HISTOGRAM       NUM_BUCKETS
----------- -------- -------- --------- ---------- --------------- -----------
C1          003      100      VARCHAR2          .5 NONE                      1

Take note of the lack of a histogram.

Now let’s see what the CBO estimates for a simple query with and without quotes (explicit cast and implicit cast).

SQL> explain plan for select count(*) from foo where c1=003;

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 1342139204

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   875   (3)| 00:00:11 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| FOO  |  1000K|  3906K|   875   (3)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(TO_NUMBER("C1")=003)

14 rows selected.

SQL> explain plan for select count(*) from foo where c1='003';

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 1342139204

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   868   (2)| 00:00:11 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| FOO  |  1000K|  3906K|   868   (2)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("C1"='003')

14 rows selected.

In this case the estimated number of rows is spot on – 1 million rows. Now lets regather stats and because of our queries using C1 predicates, it will become a candidate for a histogram. We can see this from sys.col_usage$.

select  oo.name owner,
        o.name table_name,
        c.name column_name,
        u.equality_preds,
        u.equijoin_preds,
        u.nonequijoin_preds,
        u.range_preds,
        u.like_preds,
        u.null_preds,
        u.timestamp
from    sys.col_usage$ u,
        sys.obj$ o,
        sys.user$ oo,
        sys.col$ c
where   o.obj#   = u.obj#
and     oo.user# = o.owner#
and     c.obj#   = u.obj#
and     c.col#   = u.intcol#
and     oo.name  = 'GRAHN'
and     o.name   = 'FOO'
/

OWNER TABLE_NAME COLUMN_NAME EQUALITY_PREDS EQUIJOIN_PREDS NONEQUIJOIN_PREDS RANGE_PREDS LIKE_PREDS NULL_PREDS TIMESTAMP
----- ---------- ----------- -------------- -------------- ----------------- ----------- ---------- ---------- -------------------
GRAHN FOO        C1                       1              0                 0           0          0          0 2011-06-08 22:29:59

Regather stats and re-check the column stats:

SQL> exec dbms_stats.gather_table_stats(user,'foo');

PL/SQL procedure successfully completed.

SQL> select
  2     a.column_name,
  3     display_raw(a.low_value,b.data_type) as low_val,
  4     display_raw(a.high_value,b.data_type) as high_val,
  5     b.data_type,
  6     a.density,
  7     a.histogram,
  8     a.num_buckets
  9  from
 10     user_tab_col_statistics a, user_tab_cols b
 11  where
 12     a.table_name='FOO' and
 13     a.table_name=b.table_name and
 14     a.column_name=b.column_name
 15  /

COLUMN_NAME LOW_VAL  HIGH_VAL DATA_TYPE    DENSITY HISTOGRAM       NUM_BUCKETS
----------- -------- -------- --------- ---------- --------------- -----------
C1          003      100      VARCHAR2  2.5192E-07 FREQUENCY                 2

Note the presence of a frequency histogram. Now let’s re-explain:

SQL> explain plan for select count(*) from foo where c1=003;

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 1342139204

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   875   (3)| 00:00:11 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| FOO  |     1 |     4 |   875   (3)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(TO_NUMBER("C1")=003)

SQL> explain plan for select count(*) from foo where c1='003';

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 1342139204

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   868   (2)| 00:00:11 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| FOO  |  1025K|  4006K|   868   (2)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("C1"='003')

And whammy! Note that the implicit cast [ filter(TO_NUMBER(“C1″)=003) ] now has an estimate of 1 row (when we know there is 1 million).
So what is going on here? Let’s dig into the optimizer trace for some insight:

SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for FOO[FOO]
  Column (#1):
    NewDensity:0.243587, OldDensity:0.000000 BktCnt:5458, PopBktCnt:5458, PopValCnt:2, NDV:2
  Column (#1): C1(
    AvgLen: 4 NDV: 2 Nulls: 0 Density: 0.243587
    Histogram: Freq  #Bkts: 2  UncompBkts: 5458  EndPtVals: 2
  Using prorated density: 0.000000 of col #1 as selectvity of out-of-range/non-existent value pred
  Table: FOO  Alias: FOO
    Card: Original: 2000000.000000  Rounded: 1  Computed: 0.50  Non Adjusted: 0.50
  Access Path: TableScan
    Cost:  875.41  Resp: 875.41  Degree: 0
      Cost_io: 853.00  Cost_cpu: 622375564
      Resp_io: 853.00  Resp_cpu: 622375564
  Best:: AccessPath: TableScan
         Cost: 875.41  Degree: 1  Resp: 875.41  Card: 0.50  Bytes: 0

As you can see from the line

Using prorated density: 0.000000 of col #1 as selectvity of out-of-range/non-existent value pred

The presence of the histogram and the implicit conversion of TO_NUMBER(“C1″)=003 causes the CBO to use a density of 0 because it thinks it’s a non-existent value. The reason for this is that TO_NUMBER(“C1″)=003 is the same as TO_NUMBER(“C1″)=3 and for the histogram the CBO uses TO_CHAR(C1)=’3′ and 3 is not present in the histogram only ‘003’ and ‘100’.

Dumb Luck?

So, what if the predicate contained a number that was not left padded with zeros, say 100, the other value we put in the table?

SQL> explain plan for select count(*) from foo where c1=100;

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------
Plan hash value: 1342139204

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   875   (3)| 00:00:11 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| FOO  |  1009K|  3944K|   875   (3)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(TO_NUMBER("C1")=100)

While not exact, the CBO estimate is quite close to the 1 million rows with C1=’100′.

Summary

It’s quite clear that Martin’s issue came down to the following:

  • implicit casting
  • presences of histogram
  • zero left padded number/string

The combination of these created a scenario where the CBO thinks the value is out-of-range and uses a prorated density of 0 resulting in a cardinality of 1 when there are many more rows than 1.

The moral of the story here is always cast your predicates correctly. This includes explicit cast of date types as well – never rely on the nls settings.

All tests performed on 11.2.0.2.

NoCOUG Interview

Since I’ve been on a blogging hiatus for the past few months (and hopefully that will change shortly) I thought I’d mention that the inverview I did with the famous Gwen (Chen) Shapira has now been published in the May 2011 NoCOUG Journal. Hopefully you find it interesting and insightful. Feel free to leave me a comment on your thoughts.