Troubleshooting Bad Execution Plans
November 21, 2007One of the most common performance issues DBAs encounter are bad execution plans. Many try to resolve bad executions plans by setting optimizer related parameters or even hidden underscore parameters. Some even try to decipher a long and complex 10053 trace in hopes to find an answer. While changing parameters or analyzing a 10053 trace might be useful for debugging at some point, I feel there is a much more simple way to start to troubleshoot bad execution plans.
Verify The Query Matches The Business Question
This seems like an obvious thing to do, but I’ve seen numerous cases where the SQL query does not match the business question being asked. Do a quick sanity check verifying things like: join columns, group by, subqueries, etc. The last thing you want to do is consume time trying to debug a bad plan for an improperly written SQL query. Frequently I’ve found that this is the case for many of those “I’ve never got it to run to completion” queries.
What Influences The Execution Plan
I think it’s important to understand what variables influence the Optimizer in order to focus the debugging effort. There are quite a number of variables, but frequently the cause of the problem ones are: (1) non-default optimizer parameters and (2) non-representative object/system statistics. Based on my observations I would say that the most abused Optimizer parameters are:
- OPTIMIZER_INDEX_CACHING
- OPTIMIZER_INDEX_COST_ADJ
- DB_FILE_MULTIBLOCK_READ_COUNT
Many see setting these as a solution to get the Optimizer to choose an index plan over a table scan plan, but this is problematic in several ways:
- This is a global change to a local problem
- Although it appears to solve one problem, it is unknown how many bad execution plans resulted from this change
- The root cause of why the index plan was not chosen is unknown, just that tweaking parameters gave the desired result
- Using non-default parameters makes it almost impossible to correctly and effectively troubleshoot the root cause
Object and system statistics can have a large influence on execution plans, but few actually take the time to sanity check them during triage. These statistics exist in views like:
- ALL_TAB_COL_STATISTICS
- ALL_PART_COL_STATISTICS
- ALL_INDEXES
- SYS.AUX_STATS$
Using GATHER_PLAN_STATISTICS With DBMS_XPLAN.DISPLAY_CURSOR
As a first step of triage, I would suggest executing the query with a GATHER_PLAN_STATISTICS hint followed by a call to DBMS_XPLAN.DISPLAY_CURSOR. The GATHER_PLAN_STATISTICS hint allows for the collection of extra metrics during the execution of the query. Specifically, it shows us the Optimizer’s estimated number of rows (E-Rows) and the actual number of rows (A-Rows) for each row source. If the estimates are vastly different from the actual, one probably needs to investigate why. For example: In the below plan, look at line 8. The Optimizer estimates 5,899 rows and the row source actually returns 5,479,000 rows. If the estimate is off by three orders of magnitude (1000), chances are the plan will be sub-optimal. Do note that with Nested Loop Joins you need to multiply the Starts column by the E-Rows column to get the A-Rows values (see line 10).
select /*+ gather_plan_statistics */ ... from ... ; select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));
------------------------------------------------------------------------------------------ | Id | Operation | Name | Starts | E-Rows | A-Rows | ------------------------------------------------------------------------------------------ | 1 | SORT GROUP BY | | 1 | 1 | 1 | |* 2 | FILTER | | 1 | | 1728K | | 3 | NESTED LOOPS | | 1 | 1 | 1728K | |* 4 | HASH JOIN | | 1 | 1 | 1728K | | 5 | PARTITION LIST SINGLE | | 1 | 6844 | 3029 | |* 6 | INDEX RANGE SCAN | PROV_IX13 | 1 | 6844 | 3029 | | 7 | PARTITION LIST SINGLE | | 1 | 5899 | 5479K | |* 8 | TABLE ACCESS BY LOCAL INDEX ROWID | SERVICE | 1 | 5899 | 5479K | |* 9 | INDEX SKIP SCAN | SERVICE_IX8 | 1 | 4934 | 5479K | | 10 | PARTITION LIST SINGLE | | 1728K | 1 | 1728K | |* 11 | INDEX RANGE SCAN | CLAIM_IX7 | 1728K | 1 | 1728K | ------------------------------------------------------------------------------------------
Using The CARDINALITY Hint
Now that I’ve demonstrated how to compare the cardinality estimates to the actual number of rows, what are the debugging options? If one asserts that the Optimizer will choose the optimal plan if it can accurately estimate the number of rows, one can test using the not so well (un)documented CARDINALITY hint. The CARDINALITY hint tells the Optimizer how many rows are coming out of a row source. The hint is generally used like such:
select /*+ cardinality(a 100) */ * from dual a; -------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 100 | 200 | 2 (0)| 00:00:01 | | 1 | TABLE ACCESS FULL| DUAL | 100 | 200 | 2 (0)| 00:00:01 | --------------------------------------------------------------------------
In this case I told the Optimizer that DUAL would return 100 rows (when in reality it returns 1 row) as seen in the Rows column from the autotrace output. The CARDINALITY hint is one tool one can use to give the Optimizer accurate information. I usually find this the best way to triage a bad plan as it is not a global change, it only effects a single execution of a statement in my session. If luck has it that using a CARDINALITY hint yields an optimal plan, one can move on to debugging where the cardinality is being miscalculated. Generally the bad cardinality is the result of non-representative table/column stats, but it also may be due to data correlation or other factors. This is where it pays off to know and understand the size and shape of the data. If the Optimizer still chooses a bad plan even with the correct cardinality estimates, it’s time to place a call to Oracle Support as more in-depth debugging is likely required.
Where Cardinality Can Go Wrong
There are several common scenarios that can lead to inaccurate cardinality estimates. Some of those on the list are:
- Data skew: Is the NDV inaccurate due to data skew and a poor dbms_stats sample?
- Data correlation: Are two or more predicates related to each other?
- Out-of-range values: Is the predicate within the range of known values?
- Use of functions in predicates: Is the 5% cardinality guess for functions accurate?
- Stats gathering strategies: Is your stats gathering strategy yielding representative stats?
Some possible solutions to these issues are:
- Data skew: Choose a sample size that yields accurate NDV. Use DBMS_STATS.AUTO_SAMPLE_SIZE in 11g.
- Data correlation: Use Extended Stats in 11g. If <= 10.2.0.3 use a CARDINALITY hint if possible.
- Out-of-range values: Gather or manually set the statistics.
- Use of functions in predicates: Use a CARDINALITY hint where possible.
- Stats gathering strategies: Use AUTO_SAMPLE_SIZE. Adjust only where necessary. Be mindful of tables with skewed data.
How To Best Work With Oracle Support
If you are unable to get to the root cause on your own, it is likely that you will be in contact with Oracle Support. To best assist the support analyst I would recommend you gather the following in addition to the query text:
- Output from the GATHER_PLAN_STATISTICS and DBMS_XPLAN.DISPLAY_CURSOR
- SQLTXPLAN output. See Metalink Note 215187.1
- 10053 trace output. See Metalink Note 225598.1
- DDL for all objects used (and dependencies) in the query. This is best gotten as a expdp (data pump) using CONTENT=METADATA_ONLY. This will also include the object statistics.
- Output from:
select pname, pval1 from sys.aux_stats$ where sname='SYSSTATS_MAIN'; - A copy of your init.ora
Having this data ready before you even make the call (or create the SR on-line) should give you a jump on getting a quick(er) resolution.
Summary
While this blog post is not meant to be a comprehensive troubleshooting guide for bad execution plans, I do hope that it does help point you in the right direction the next time you encounter one. Many of the Optimizer issues I’ve seen are due to incorrect cardinality estimates, quite often due to inaccurate NDV or the result of data correlation. I believe that if you use a systematic approach you will find that debugging bad execution plans may be as easy as just getting the cardinality estimate correct.
21 Responses to “Troubleshooting Bad Execution Plans”
Hi Greg,
Can you elaborate this statement:
“Do note that with Nested Loop Joins you need to multiply the Starts column by the E-Rows column to get the A-Rows values”
I can’t see how multiplying Starts with E-Rows creates any meaningful results?
As far as I know the E-rows are entirely unrelated to A-rows and Starts, the first being just optimizers estimation cardinality and last two being actual
measurements done during execution.
Or did you mean that for the inner rowsource in NL join the E-Rows means estimated rows per one lookup iteration while A-Rows is cumulative across
all lookup iterations?
Thanks.
By Tanel Poder on Nov 24, 2007
Tanel-
You are correct in that with NJ joins the value for the inner row source E-Rows represents the per iteration value. Thus one must multiply the number of rows for a single iteration times the number of iterations. The A-Rows value is the cumulative for the given operation.
By Greg Rahn on Nov 24, 2007
Hi sir,
Very nice.
Regards
Aman….
By Aman.... on Dec 14, 2007
I have the same reaction as Tanel. It seems that E-Rows must be coming from v$sql_plan.cardinality and is just an estimate where as A-ROWS and STARTS are actual execution statistics and come from v$sql_plan_statistics so it’s weird to think of starts and e-rows being interelated when they are apples and oranges - ie estimated and actual. I guess the way to look at it is that cardinality from v$sql_plan is the estimated per iteration value, with number of iterations unknown?
By Kyle on Jun 24, 2008
@Kyle
I can see why you and Tanel have asked about this. It’s two columns that are supposed to be comparable, but the “units” of each column can differ for different row sources. To top it off, it is even again different if a bitmap index is involved: then the “A-Rows” is actually not rows, but the number of bitmap fragments. Yikes!
By Greg Rahn on Jul 2, 2008
Hi Greg,
I tried to run the two statements mentioned above but could not get the desired output where it shows the “E-rows” and “A-rows”.
Please show me how to do this.
Thanks,
Patrick.
By Patrick Huynh on Jul 20, 2008
@Patrick
What version are you using? This only works in 10g and newer.
SQL> select /*+ gather_plan_statistics */ ename, job, dept.deptno, dname 2 from emp, dept 3 where emp.deptno = dept.deptno 4 and job = 'CLERK'; ENAME JOB DEPTNO DNAME ---------- --------- ---------- -------------- MILLER CLERK 10 ACCOUNTING ADAMS CLERK 20 RESEARCH SMITH CLERK 20 RESEARCH JAMES CLERK 30 SALES SQL> SQL> select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST')); PLAN_TABLE_OUTPUT ------------------------------------------------------------------------------ SQL_ID ffbu0q9huc86n, child number 0 ------------------------------------- select /*+ gather_plan_statistics */ ename, job, dept.deptno, dname from emp, dept where emp.deptno = dept.deptno and job = 'CLERK' Plan hash value: 1123238657 --------------------------------------------------------------... | Id | Operation | Name | Starts | E-Rows | A-Rows |... --------------------------------------------------------------... |* 1 | HASH JOIN | | 1 | 3 | 4 |... |* 2 | TABLE ACCESS FULL| EMP | 1 | 3 | 4 |... | 3 | TABLE ACCESS FULL| DEPT | 1 | 4 | 4 |... --------------------------------------------------------------... Predicate Information (identified by operation id): --------------------------------------------------- 1 - access("EMP"."DEPTNO"="DEPT"."DEPTNO") 2 - filter("JOB"='CLERK') 21 rows selected.By Greg Rahn on Jul 20, 2008
Hi Greg,
It’s working now. I don’t why i didn’t see it before.
Thanks,
Patrick
By Patrick Huynh on Jul 20, 2008
How do you make this work for parallel queries? I see A-Rows and A-Time as zero for all the step of PQ operations. Not sure if it’s even possible as there would be different procesess involved when PQO being used…
-Dharmendra
By Dharmendra on Jul 22, 2008
@Dharmendra
As I recall
ALLSTATS LASTdoes not work for PQ because it only uses the QC stats and not the slaves stats. What you need to do is cause a hard parse (put a unique comment string in the query) and useALLSTATSlike such:SQL> select /*+ run1 gather_plan_statistics full(a) */ count(*) 2 from FOO a where a.ORGANIZATION_ID='00D000000000062'; COUNT(*) ---------- 142571 SQL> select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS')); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------------------------------- SQL_ID duuhzc5f842ff, child number 0 ------------------------------------- select /*+ run1 gather_plan_statistics full(a) */ count(*) from FOO a where a.ORGANIZATION_ID='00D000000000062' Plan hash value: 1756755430 -------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | -------------------------------------------------------------------------------------------------------- | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:04.70 | 7 | 0 | | 2 | PX COORDINATOR | | 1 | | 4 |00:00:04.70 | 7 | 0 | | 3 | PX SEND QC (RANDOM) | :TQ10000 | 0 | 1 | 0 |00:00:00.01 | 0 | 0 | | 4 | SORT AGGREGATE | | 4 | 1 | 4 |00:00:18.67 | 9979 | 9601 | | 5 | PX BLOCK ITERATOR | | 4 | 605 | 142K|00:00:17.80 | 9979 | 9601 | |* 6 | TABLE ACCESS FULL| FOO | 54 | 605 | 142K|00:00:13.78 | 9979 | 9601 | -------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 6 - access(:Z>=:Z AND :Z< =:Z) filter("A"."ORGANIZATION_ID"='00D000000000062') 24 rows selected.By Greg Rahn on Jul 23, 2008
can the cardinality hint be applied to things like INDEX FAST FULL SCAN?
I’ve tried applying it but the results aren’t what I’d expect.
Here’s my plan
STARTS E_ROWS A_ROWS RATIO operation ------- ---------- ---------- ---------- ------------------------------------------ SELECT STATEMENT ~ 1 1 1 1 SORT ~ AGGREGATE 1 1 1657 .0006 TABLE ACCESS ~ BY INDEX ROWID WB_JOB 1 1 46346 0 NESTED LOOPS ~ 1 1 23171 0 HASH JOIN ~ 1 1 1 1 INDEX ~ RANGE SCAN PS0PAY_CALENDAR 1 1925 23171 .0831 VIEW ~ VW_SQ_1 1 1925 23171 .0831 HASH ~ GROUP BY 1 1925 33020 .0583 NESTED LOOPS ~ 1 1 1 1 INDEX ~ RANGE SCAN PS0PAY_CALENDAR 1 2000 33020 .0606 INDEX ~ FAST FULL SCAN WB_JOB 23171 23171 23174 .9999 INDEX ~ RANGE SCAN WB_JOBI’ve tried cardinality hint to fix the discrepancy at the next to last line of the plan. For example
SELECT A.COMPANY , A.PAYGROUP from PS_PAY_CALENDAR A , WB_JOB B where A.RUN_ID = 'PD2' and A.PAY_CONFIRM_RUN = 'N' and B.COMPANY = A.COMPANY and B.PAYGROUP = A.PAYGROUP and B.EFFDT = (SELECT /*+ cardinality(F 33000) */ MAX(F.EFFDT) from WB_JOB F where F.EMPLID = B.EMPLID and F.EMPL_RCD# = B.EMPL_RCD# and F.EFFDT< = A.PAY_END_DT) /but whether I shift the cardinality up or down via the hint, the discrepancy in the plan becomes works (both with higher and lower cardinality hint values)
By Kyle Hailey on Sep 3, 2008
Actually I can get the cardinality on the next line to go up about a maximum of 7500 and then it abruptly drops off to 286
when taking the hint from /*+ cardinality(F 50000) */ to /*+ cardinality(F 150000) */
outside of the bounds (give or take a few 100) the cardinality drops off.
By Kyle Hailey on Sep 3, 2008
@Kyle
The CARDINALITY hint only tells the Optimizer how many rows to expect from a given row source, so it must be applied to a table, subquery, query block, table function, etc.
In your case I think you will have to use a query block alias because you want to tell the Optimizer the cardinality coming out of a join. There easiest way to do this is to use this command:
select * from table(dbms_xplan.display_cursor(null,null,'+alias'));to print the Query Block Name and Object Alias.
Here is a simple example with EMP and DEPT:
Now add the hint.
Hope this helps.
By Greg Rahn on Sep 3, 2008
very cool, but alias doesn’t seem to do much for me. Here is the plan with aliases (cool feature)
but the it doesn’t seem to affect the estimated cardinalities
By Kyle Hailey on Sep 3, 2008
forgot to give the plan with hints (this is for the last explain plan)
SELECT /*+ cardinality(SEL$11A0A147 1) cardinality(SEL$975CABD1 23171) */ A.COMPANY , A.PAYGROUP from PS_PAY_CALENDAR A , WB_JOB B where A.RUN_ID = 'PD2' and A.PAY_CONFIRM_RUN = 'N' and B.COMPANY = A.COMPANY and B.PAYGROUP = A.PAYGROUP and B.EFFDT = ( SELECT MAX(F.EFFDT) from WB_JOB F where F.EMPLID = B.EMPLID and F.EMPL_RCD# = B.EMPL_RCD# and F.EFFDT< = A.PAY_END_DT) )By Kyle Hailey on Sep 3, 2008
Just to be a little clearer, here is the standard output
There are only 2 query blocks:
1 - SEL$11A0A147
7 - SEL$975CABD1
Query block ID=1 should have a cardinality of 1657 but says 1 so I’ll put in the hint
cardinality(SEL$11A0A147 1657
Query block at ID=7 should have a cardinality of 23171 but is esitmated at 1925 so I add the hint
cardinality(SEL$975CABD1 23171)
but after adding the hint the query plan and statistics are exactly the same
By Kyle Hailey on Sep 3, 2008