<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Structured Data &#187; Parallel Execution</title>
	<atom:link href="http://structureddata.org/category/oracle/parallel-execution/feed/" rel="self" type="application/rss+xml" />
	<link>http://structureddata.org</link>
	<description>Data, Databases, Performance &#38; Scalability</description>
	<lastBuildDate>Mon, 02 Apr 2012 05:30:31 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Counting Triangles Faster</title>
		<link>http://structureddata.org/2011/10/17/counting-triangles-faster/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=counting-triangles-faster</link>
		<comments>http://structureddata.org/2011/10/17/counting-triangles-faster/#comments</comments>
		<pubDate>Mon, 17 Oct 2011 21:05:49 +0000</pubDate>
		<dc:creator>Greg Rahn</dc:creator>
				<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Parallel Execution]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[graph analysis]]></category>
		<category><![CDATA[Vertica]]></category>

		<guid isPermaLink="false">http://structureddata.org/?p=1559</guid>
		<description><![CDATA[A few weeks back one of the Vertica developers put up a blog post on counting triangles in an undirected graph with reciprocal edges. The author was comparing the size of the data and the elapsed times to run this calculation on Hadoop and Vertica and put up the work on github and encouraged others: &#8220;do try this at home.&#8221; So I did. Compression Vertica draws attention to the fact that their compression brought the size of the 86,220,856 tuples down to 560MB in size, from a flat file size of 1,263,234,543 bytes resulting in around a 2.25X compression ratio. My first task was to load the data and see how Oracle&#8217;s Hybrid Columnar Compression would compare. Below is a graph of the sizes. As you can see, Oracle&#8217;s default HCC query compression (query high) compresses the data over 2X more than Vertica and even HCC query low compression beats out Vertica&#8217;s compression number. Query Elapsed Times The closest gear I had to Vertica&#8217;s hardware was an Exadata X2-2 system &#8212; both use 2 socket, 12 core Westmere-EP nodes. While one may try to argue that Exadata may somehow influence the execution times, I&#8217;ll point out that I was using [...]]]></description>
			<content:encoded><![CDATA[<p>A few weeks back one of the Vertica developers put up a <a href="http://www.vertica.com/2011/09/21/counting-triangles/" target="_blank">blog post on counting triangles</a> in an undirected graph with reciprocal edges.  The author was comparing the size of the data and the elapsed times to run this calculation on Hadoop and Vertica and put up the work on github and encouraged others: &#8220;do try this at home.&#8221;  So I did.</p>
<h3>Compression</h3>
<p>Vertica draws attention to the fact that their compression brought the size of the 86,220,856 tuples down to 560MB in size, from a flat file size of 1,263,234,543 bytes resulting in around a 2.25X compression ratio.  My first task was to load the data and see how Oracle&#8217;s Hybrid Columnar Compression would compare.  Below is a graph of the sizes.</p>
<p><a href="http://structureddata.org/wp-content/uploads/2011/10/compression.png"><img src="http://structureddata.org/wp-content/uploads/2011/10/compression.png" alt="" title="compression" width="1408" height="958" class="aligncenter size-full wp-image-1596" /></a></p>
<p>As you can see, Oracle&#8217;s default HCC query compression (query high) compresses the data over 2X more than Vertica and even HCC query low compression beats out Vertica&#8217;s compression number.  </p>
<h3>Query Elapsed Times</h3>
<p>The closest gear I had to Vertica&#8217;s hardware was an <a href="http://www.oracle.com/technetwork/database/exadata/dbmachine-x2-2-datasheet-175280.pdf" target="_blank">Exadata X2-2</a> system &#8212; both use 2 socket, 12 core Westmere-EP nodes.  While one may try to argue that Exadata may somehow influence the execution times, I&#8217;ll point out that I was using <a href="http://download.oracle.com/docs/cd/E11882_01/server.112/e25554/px.htm#BCECBIDF">In-Memory Parallel Execution</a> so no table data was even read from spinning disk or Exadata Flash Cache &#8212; it&#8217;s all memory resident in the database nodes&#8217; buffer cache.  This seems to be inline with how Vertica executed their tests though not explicitly stated (it&#8217;s a reasonable assertion).  </p>
<p>After I loaded the data and gathered table stats, I fired off the exact same SQL query that Vertica used to count triangles to see how Oracle would compare.  I ran the query on 1, 2 and 4 nodes just like Vertica.  Below is a graph of the results.</p>
<p><a href="http://structureddata.org/wp-content/uploads/2011/10/elapsed1.png"><img src="http://structureddata.org/wp-content/uploads/2011/10/elapsed1.png" alt="" title="elapsed1" width="1408" height="958" class="aligncenter size-full wp-image-1567" /></a></p>
<p>As you can see, the elapsed times are reasonably close but overall in the favor of Oracle winning 2 of the 3 scale points as well as having a lower sum of the three executions:  Vertica 519 seconds, Oracle 487 seconds &#8212; advantage Oracle of 32 seconds.</p>
<h3>It Should Go Faster!</h3>
<p>As a database performance engineer I was thinking to myself, &#8220;it really should go faster!&#8221;  I took a few minutes to look over things to see what could make this perform better.  You might think I was looking at parameters or something like that, but you would be wrong.  After a few minutes of looking at the query and the execution plan it became obvious to me &#8212; it could go faster!  I made a rather subtle change to the SQL query and reran my experiments.  With the modified SQL query Oracle was now executing twice as fast on 1 node than Vertica was on 4 nodes.  Also, on 4 nodes, the elapsed time came in at just 14 seconds, compared to the 97 seconds Vertica reported &#8212; a difference of almost 7X!  Below are the combined results.</p>
<p><a href="http://structureddata.org/wp-content/uploads/2011/10/elapsed2.png"><img src="http://structureddata.org/wp-content/uploads/2011/10/elapsed2.png" alt="" title="elapsed2" width="1408" height="958" class="aligncenter size-full wp-image-1568" /></a></p>
<h3>What&#8217;s The Go Fast Trick?</h3>
<p>I was thinking a bit more about the problem at hand &#8212; we need to count vertices but not count them twice since they are reciprocal.  Given that for any edge, it exists in both directions, the query can be structured like Vertica wrote it &#8212; doing the filtering with a join predicate like <strong>e1.source < e2.source</strong> to eliminate the duplicates or we can simply use a single table filter predicate like <strong>source < dest</strong> <em>before</em> the join takes place.  One of the first things they taught me in query planning and optimization class was to filter early!  That notation pays off big here because the early filter cuts the rows going into the first join as well as the output of the first join by a factor of 2 &#8212; 1.8 billion rows output vs. 3.6 billion.  That&#8217;s a huge savings not only in the first join, but also in the second join as well.</p>
<p>Here is what my revised query looks like:</p>
<pre class="brush: sql; title: ; notranslate">
with
  e1 as (select * from edges where source &lt; dest),
  e2 as (select * from edges where source &lt; dest),
  e3 as (select * from edges where source &gt; dest)
select count(*)
from e1
join e2 on (e1.dest = e2.source)
join e3 on (e2.dest = e3.source)
where e3.dest = e1.source
</pre>
<h3>Summary</h3>
<p>First, I&#8217;d like to thank the Vertica team for throwing the challenge out there and being kind enough to provide the data, code and their elapsed times.  I always enjoy a challenge &#8212; especially one that I can improve upon.  Now, I&#8217;m not going to throw any product marketing nonsense out there as that is certainly not my style (and there certainly is more than enough of that already), but rather I&#8217;ll just let the numbers do the talking.  I&#8217;d also like to point out that this experiment was done without any structure other than the table.  And in full disclosure, all of my SQL commands are available as well.</p>
<p>The other comment that I would make is that the new and improved execution times really make a mockery of the exercise when comparing to Hadoop MapReduce or Pig, but I would also mention that this test case is extremely favorable for parallel pipelined databases that can perform all in-memory operations and given the data set is so small, this is the obviously the case.  Overall, in my opinion, a poor problem choice to compare the three technologies as it obviously (over) highlights the right tool for the job cliche.</p>
<p>Experiments performed on Oracle Database 11.2.0.2.</p>
<p><script src="https://gist.github.com/1289188.js"> </script></p>
]]></content:encoded>
			<wfw:commentRss>http://structureddata.org/2011/10/17/counting-triangles-faster/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Reading Parallel Execution Plans With Bloom Pruning And Composite Partitioning</title>
		<link>http://structureddata.org/2010/10/12/reading-parallel-execution-plans-with-bloom-pruning-and-composite-partitioning/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=reading-parallel-execution-plans-with-bloom-pruning-and-composite-partitioning</link>
		<comments>http://structureddata.org/2010/10/12/reading-parallel-execution-plans-with-bloom-pruning-and-composite-partitioning/#comments</comments>
		<pubDate>Tue, 12 Oct 2010 16:00:41 +0000</pubDate>
		<dc:creator>Greg Rahn</dc:creator>
				<category><![CDATA[Data Warehousing]]></category>
		<category><![CDATA[Execution Plans]]></category>
		<category><![CDATA[Optimizer]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Parallel Execution]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[SQL Tuning]]></category>
		<category><![CDATA[Troubleshooting]]></category>
		<category><![CDATA[VLDB]]></category>
		<category><![CDATA[Bloom Filter]]></category>
		<category><![CDATA[Bloom Pruning]]></category>

		<guid isPermaLink="false">http://structureddata.org/?p=1189</guid>
		<description><![CDATA[You&#8217;ve probably heard sayings like &#8220;sometimes things aren&#8217;t always what they seem&#8221; and &#8220;people lie&#8221;. Well, sometimes execution plans lie. It&#8217;s not really by intent, but it is sometimes difficult (or impossible) to represent everything in a query execution tree in nice tabular format like dbms_xplan gives. One of the optimizations that was introduced back in 10gR2 was the use of bloom filters. Bloom filters can be used in two ways: 1) for filtering or 2) for partition pruning (bloom pruning) starting with 11g. Frequently the data models used in data warehousing are dimensional models (star or snowflake) and most Oracle warehouses use simple range (or interval) partitioning on the fact table date key column as that is the filter that yields the largest I/O reduction from partition pruning (most queries in a time series star schema include a time window, right!). As a result, it is imperative that the join between the date dimension and the fact table results in partition pruning. Let&#8217;s consider a basic two table join between a date dimension and a fact table. For these examples I&#8217;m using STORE_SALES and DATE_DIM which are TPC-DS tables (I frequently use TPC-DS for experiments as it uses a [...]]]></description>
			<content:encoded><![CDATA[<p>You&#8217;ve probably heard sayings like  &#8220;sometimes things aren&#8217;t always what they seem&#8221; and &#8220;people lie&#8221;.  Well, sometimes execution plans lie.  It&#8217;s not really by intent, but it is sometimes difficult (or impossible) to represent everything in a query execution tree in nice tabular format like dbms_xplan gives.</p>
<p>One of the optimizations that was introduced back in 10gR2 was the use of <a href="http://en.wikipedia.org/wiki/Bloom_filter">bloom filters</a>.  Bloom filters can be used in two ways: 1) for filtering or 2) for partition pruning (bloom pruning) starting with 11g.  Frequently the data models used in data warehousing are <a href="http://en.wikipedia.org/wiki/Dimensional_modeling">dimensional models</a> (star or snowflake) and most Oracle warehouses use simple range (or interval) partitioning on the fact table date key column as that is the filter that yields the largest I/O reduction from partition pruning (most queries in a time series star schema include a time window, right!).  As a result, it is imperative that the join between the date dimension and the fact table results in partition pruning.</p>
<p>Let&#8217;s consider a basic two table join between a date dimension and a fact table.  For these examples I&#8217;m using STORE_SALES and DATE_DIM which are <a href="http://www.tpc.org/tpcds/tpcds.asp">TPC-DS</a> tables (I frequently use TPC-DS for experiments as it uses a dimensional (star) model and has a data generator.) STORE_SALES contains a 5 year window of data ranging from 1998-01-02 to 2003-01-02.</p>
<h2>Range Partitioned STORE_SALES</h2>
<p>For this example I used range partitioning on STORE_SALES.SS_SOLD_DATE_SK using 60 one month partitions (plus 1 partition for NULL SS_SOLD_DATE_SK values) that align with the date dimension (DATE_DIM) on calendar month boundaries. STORE_SALES has the parallel attribute (PARALLEL 16 in this case) set on the table to enable Oracle&#8217;s Parallel Execution (PX).  Let&#8217;s look at the execution time and plan for our test query:</p>
<pre class="brush: sql; title: ; notranslate">
SQL&gt; select
  2    max(ss_sales_price)
  3  from
  4    store_sales ss,
  5    date_dim d
  6  where
  7    ss_sold_date_sk = d_date_sk and
  8    d_year = 2000
  9  ;

MAX(SS_SALES_PRICE)
-------------------
                200

Elapsed: 00:00:41.67

SQL&gt; select * from table(dbms_xplan.display_cursor(format=&gt;'basic +parallel +partition +predicate'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------
EXPLAINED SQL STATEMENT:
------------------------
select   max(ss_sales_price) from   store_sales ss,   date_dim d where
 ss_sold_date_sk=d_date_sk and   d_year = 2000

Plan hash value: 934332680

---------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name         | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |             |       |       |        |      |            |
|   1 |  SORT AGGREGATE               |             |       |       |        |      |            |
|   2 |   PX COORDINATOR              |             |       |       |        |      |            |
|   3 |    PX SEND QC (RANDOM)        | :TQ10001    |       |       |  Q1,01 | P-&gt;S | QC (RAND)  |
|   4 |     SORT AGGREGATE            |             |       |       |  Q1,01 | PCWP |            |
|*  5 |      HASH JOIN                |             |       |       |  Q1,01 | PCWP |            |
|   6 |       BUFFER SORT             |             |       |       |  Q1,01 | PCWC |            |
|   7 |        PART JOIN FILTER CREATE| :BF0000     |       |       |  Q1,01 | PCWP |            |
|   8 |         PX RECEIVE            |             |       |       |  Q1,01 | PCWP |            |
|   9 |          PX SEND BROADCAST    | :TQ10000    |       |       |        | S-&gt;P | BROADCAST  |
|* 10 |           TABLE ACCESS FULL   | DATE_DIM    |       |       |        |      |            |
|  11 |       PX BLOCK ITERATOR       |             |:BF0000|:BF0000|  Q1,01 | PCWC |            |
|* 12 |        TABLE ACCESS FULL      | STORE_SALES |:BF0000|:BF0000|  Q1,01 | PCWP |            |
--------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access(&quot;SS_SOLD_DATE_SK&quot;=&quot;D_DATE_SK&quot;)
  10 - filter(&quot;D_YEAR&quot;=2000)
  12 - access(:Z&gt;=:Z AND :Z&lt;=:Z)
</pre>
<p>In this execution plan you can see the creation of the bloom filter on line 7 which is populated from the values of D_DATE_SK from DATE_DIM.  That bloom filter is then used to partition prune on the STORE_SALES table.  This is why we see :BF0000 in the Pstart/Pstop columns. </p>
<h2>Range-Hash Composite Partitioned STORE_SALES</h2>
<p>For the next experiment, I kept the same range partitioning scheme but also added hash subpartitioning using the SS_ITEM_SK column (using 4 hash subpartitions per range partition).  STORE_SALES2 has 61 range partitions x 4 hash subpartitions for a total of 244 aggregate partitions.  Let&#8217;s look at the execution plan for our test query:</p>
<pre class="brush: sql; title: ; notranslate">
SQL&gt; select
  2    max(ss_sales_price)
  3  from
  4    store_sales2 ss,
  5    date_dim d
  6  where
  7    ss_sold_date_sk = d_date_sk and
  8    d_year = 2000
  9  ;

MAX(SS_SALES_PRICE)
-------------------
                200

Elapsed: 00:00:41.06

SQL&gt; select * from table(dbms_xplan.display_cursor(format=&gt;'basic +parallel +partition +predicate'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------
EXPLAINED SQL STATEMENT:
------------------------
select   max(ss_sales_price) from   store_sales2 ss,   date_dim d where
  ss_sold_date_sk=d_date_sk and   d_year = 2000

Plan hash value: 2496395846

---------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name         | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |              |       |       |        |      |            |
|   1 |  SORT AGGREGATE               |              |       |       |        |      |            |
|   2 |   PX COORDINATOR              |              |       |       |        |      |            |
|   3 |    PX SEND QC (RANDOM)        | :TQ10001     |       |       |  Q1,01 | P-&gt;S | QC (RAND)  |
|   4 |     SORT AGGREGATE            |              |       |       |  Q1,01 | PCWP |            |
|*  5 |      HASH JOIN                |              |       |       |  Q1,01 | PCWP |            |
|   6 |       BUFFER SORT             |              |       |       |  Q1,01 | PCWC |            |
|   7 |        PART JOIN FILTER CREATE| :BF0000      |       |       |  Q1,01 | PCWP |            |
|   8 |         PX RECEIVE            |              |       |       |  Q1,01 | PCWP |            |
|   9 |          PX SEND BROADCAST    | :TQ10000     |       |       |        | S-&gt;P | BROADCAST  |
|* 10 |           TABLE ACCESS FULL   | DATE_DIM     |       |       |        |      |            |
|  11 |       PX BLOCK ITERATOR       |              |     1 |     4 |  Q1,01 | PCWC |            |
|* 12 |        TABLE ACCESS FULL      | STORE_SALES2 |     1 |   244 |  Q1,01 | PCWP |            |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access(&quot;SS_SOLD_DATE_SK&quot;=&quot;D_DATE_SK&quot;)
  10 - filter(&quot;D_YEAR&quot;=2000)
  12 - access(:Z&gt;=:Z AND :Z&lt;=:Z)
</pre>
<p>Once again you can see the creation of the bloom filter from DATE_DIM on line 7, however you will notice that we no longer see :BF0000 as our Pstart and Pstop values.  In fact, it may appear that partition pruning is not taking place at all as we see 1/244 as our Pstart/Pstop values.  However, if we compare the execution times between the range and range/hash queries you note they are identical to the nearest second, thus there really is no way that partition (bloom) pruning is not taking place.  After all, if this plan read all 5 years of data it would take 5 times as long as reading just 1 year and that certainly is not the case.  Would you have guessed that partition pruning is taking place had we not worked though the range only experiment first?  Hmmm&#8230;</p>
<h2>So What Is Going On?</h2>
<p>Before we dive in, let&#8217;s quickly look at what the execution plans would look like if PX was not used (using serial execution).</p>
<pre class="brush: sql; title: ; notranslate">
--
-- Range Partitioned, Serial Execution
--

---------------------------------------------------------------------
| Id  | Operation                     | Name        | Pstart| Pstop |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT              |             |       |       |
|   1 |  SORT AGGREGATE               |             |       |       |
|*  2 |   HASH JOIN                   |             |       |       |
|   3 |    PART JOIN FILTER CREATE    | :BF0000     |       |       |
|*  4 |     TABLE ACCESS FULL         | DATE_DIM    |       |       |
|   5 |    PARTITION RANGE JOIN-FILTER|             |:BF0000|:BF0000|
|   6 |     TABLE ACCESS FULL         | STORE_SALES |:BF0000|:BF0000|
---------------------------------------------------------------------

--
-- Range-Hash Composite Partitioned, Serial Execution
--

----------------------------------------------------------------------
| Id  | Operation                     | Name         | Pstart| Pstop |
----------------------------------------------------------------------
|   0 | SELECT STATEMENT              |              |       |       |
|   1 |  SORT AGGREGATE               |              |       |       |
|*  2 |   HASH JOIN                   |              |       |       |
|   3 |    PART JOIN FILTER CREATE    | :BF0000      |       |       |
|*  4 |     TABLE ACCESS FULL         | DATE_DIM     |       |       |
|   5 |    PARTITION RANGE JOIN-FILTER|              |:BF0000|:BF0000|
|   6 |     PARTITION HASH ALL        |              |     1 |     4 |
|   7 |      TABLE ACCESS FULL        | STORE_SALES2 |     1 |   244 |
----------------------------------------------------------------------
</pre>
<p>When using composite partitioning, pruning is placed on one of the partition iterators. When the two nested partition iterators (range/hash in this case) are changed into a block iterator (line 14 &#8211; PX BLOCK ITERATOR), we have to pick a &#8220;victim&#8221; in the query plan tree since only one node in the plan needs now to carry the pruning information (with PX the pruning is really done by the QC, not the row source like in serial plans).  As a result, the information associated the the victimized partition iterator is lost in the explain plan.  This is why there is no :BF0000 for Pstart/Pstop in the plan in this case.  It is probably more accurate to have the parallel plans for both range and range/hash look like this:</p>
<pre class="brush: sql; title: ; notranslate">
---------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name         | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |              |       |       |        |      |            |
|   1 |  SORT AGGREGATE               |              |       |       |        |      |            |
|   2 |   PX COORDINATOR              |              |       |       |        |      |            |
|   3 |    PX SEND QC (RANDOM)        | :TQ10001     |       |       |  Q1,01 | P-&gt;S | QC (RAND)  |
|   4 |     SORT AGGREGATE            |              |       |       |  Q1,01 | PCWP |            |
|*  5 |      HASH JOIN                |              |       |       |  Q1,01 | PCWP |            |
|   6 |       BUFFER SORT             |              |       |       |  Q1,01 | PCWC |            |
|   7 |        PART JOIN FILTER CREATE| :BF0000      |       |       |  Q1,01 | PCWP |            |
|   8 |         PX RECEIVE            |              |       |       |  Q1,01 | PCWP |            |
|   9 |          PX SEND BROADCAST    | :TQ10000     |       |       |        | S-&gt;P | BROADCAST  |
|* 10 |           TABLE ACCESS FULL   | DATE_DIM     |       |       |        |      |            |
|  11 |       PX BLOCK ITERATOR       |              |       |       |  Q1,01 | PCWC |            |
|* 12 |        TABLE ACCESS FULL      | STORE_SALES  |:BF0000|:BF0000|  Q1,01 | PCWP |            |
---------------------------------------------------------------------------------------------------
</pre>
<p>Where the bloom pruning is on the TABLE ACCESS FULL row source.  This is because there is no Pstart/Pstop for a PX BLOCK ITERATOR row source (it&#8217;s block ranges, so partition information is lost &#8211; it had been contained in level above this).</p>
<p>Hopefully this helps you understand and correctly identify execution plans contain bloom pruning even though at first glance you may not think they do.  If you are uncertain, use the execution stats for the query looking at metrics like amount of data read and execution times to provide some empirical insight.</p>
]]></content:encoded>
			<wfw:commentRss>http://structureddata.org/2010/10/12/reading-parallel-execution-plans-with-bloom-pruning-and-composite-partitioning/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>The Core Performance Fundamentals Of Oracle Data Warehousing &#8211; Parallel Execution</title>
		<link>http://structureddata.org/2010/04/19/the-core-performance-fundamentals-of-oracle-data-warehousing-parallel-execution/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-core-performance-fundamentals-of-oracle-data-warehousing-parallel-execution</link>
		<comments>http://structureddata.org/2010/04/19/the-core-performance-fundamentals-of-oracle-data-warehousing-parallel-execution/#comments</comments>
		<pubDate>Mon, 19 Apr 2010 15:00:25 +0000</pubDate>
		<dc:creator>Greg Rahn</dc:creator>
				<category><![CDATA[Data Warehousing]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Parallel Execution]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[VLDB]]></category>
		<category><![CDATA[parallel query]]></category>
		<category><![CDATA[scalability]]></category>

		<guid isPermaLink="false">http://structureddata.org/?p=818</guid>
		<description><![CDATA[[back to Introduction] Leveraging Oracle&#8217;s Parallel Execution (PX) in your Oracle data warehouse is probably the most important feature/technology one can use to speed up operations on large data sets. &#160;PX is not, however, &#8220;go fast&#8221; magic pixi dust for any old operation (if thats what you think, you probably don&#8217;t understand the parallel computing paradigm). With Oracle PX, a large task is broken up into smaller parts, sub-tasks if you will, and each sub-task is then worked on in parallel. &#160;The goal of Oracle PX: divide and conquer.&#160;&#160;This allows a significant amount of hardware resources to be engaged in solving a single problem and is what allows the Oracle database to scale up and out when working with large data sets. I though I&#8217;d touch on some basics and add my observations but this is by far not an exhaustive write up on Oracle&#8217;s Parallel Execution. &#160;There is an entire chapter in the Oracle Database documentation on PX as well as several white papers. &#160;I&#8217;ve listed all these in the Resources section at the bottom of this post. &#160;Read them, but as always, feel free to post questions/comments here. &#160;Discussion adds great value. A Basic Example of Parallel Execution [...]]]></description>
			<content:encoded><![CDATA[<p><em>[back to </em><a href="http://structureddata.org/2009/12/14/the-core-performance-fundamentals-of-oracle-data-warehousing-introduction/"><em>Introduction</em></a><em>]</em></p>
<p>Leveraging Oracle&#8217;s Parallel Execution (PX) in your Oracle data warehouse is probably the most important feature/technology one can use to speed up operations on large data sets. &nbsp;PX is not, however, &#8220;go fast&#8221; magic pixi dust for any old operation (if thats what you think, you probably don&#8217;t understand the parallel computing paradigm). With Oracle PX, a large task is broken up into smaller parts, sub-tasks if you will, and each sub-task is then worked on in parallel. &nbsp;The goal of Oracle PX: divide and conquer.&nbsp;&nbsp;This allows a significant amount of hardware resources to be engaged in solving a single problem and is what allows the Oracle database to scale up and out when working with large data sets.</p>
<p>I though I&#8217;d touch on some basics and add my observations but this is by far not an exhaustive write up on Oracle&#8217;s Parallel Execution. &nbsp;There is an entire chapter in the <a href="http://www.oracle.com/pls/db112/homepage">Oracle Database documentation</a> on PX as well as several white papers. &nbsp;I&#8217;ve listed all these in the Resources section at the bottom of this post. &nbsp;Read them, but as always, feel free to post questions/comments here. &nbsp;Discussion adds great value.</p>
<p><strong>A Basic Example of Parallel Execution</strong></p>
<p>Consider a simple one table query like the one below.</p>
<p><img class="aligncenter size-full wp-image-857" title="cncpt017" src="http://structureddata.org/wp-content/uploads/2010/04/cncpt017.gif" alt="" width="659" height="239" /></p>
<p>You can see that the PX&nbsp;Coordinator (also known as the Query Coordinator or QC)&nbsp;breaks up the &#8220;work&#8221; into several chunks and those chunks are worked on by the PX Server Processes. &nbsp;The technical term for the chunk a PX Server Process works on is called a&nbsp;<em>granule</em>. &nbsp;Granules can either be block-based or partition-based.</p>
<p><strong>When To Use Parallel Execution</strong></p>
<p>PX is a key&nbsp;component&nbsp;in data warehousing as that is where large data sets usually exist. &nbsp;The most common operations that use PX are queries (SELECTs) and data loads (INSERTs or CTAS). &nbsp;PX is most commonly&nbsp;controlled&nbsp;by using the PARALLEL&nbsp;attribute&nbsp;on the object, although it can be controlled by hints or even Oracle&#8217;s Database&nbsp;Resource&nbsp;Manager. &nbsp;If you are not using PX in your Oracle data warehouse than you are probably missing out on a shedload of performance opportunity.</p>
<p>When an object has its PARALLEL&nbsp;attribute&nbsp;set or the <a href="http://download.oracle.com/docs/cd/E11882_01/server.112/e10592/sql_elements006.htm#SQLRF50801">PARALLEL hint</a> is used queries will leverage PX, but to leverage PX for DML operations (INSERT/DELETE/UPDATE) remember to alter your session by using the command:</p>
<pre>alter session [enable|force] parallel dml;</pre>
<p>

<p><strong>Do Not Fear Parallel Execution</strong></p>
<p>Since Oracle&#8217;s PX is designed to take advantage of multiple CPUs (or CPU cores) at a time, it can leverage significant hardware resources, if&nbsp;available. &nbsp;From my experiences in talking with Oracle DBAs, the ability for PX to do this scares them. This results in DBAs&nbsp;implementing a&nbsp;relatively&nbsp;small degree of&nbsp;parallelism&nbsp;(DOP) for a system that could possibly support a much higher level (based on #CPUs). &nbsp;Often times though, the system that PX is being run on is not a <a href="http://structureddata.org/2009/12/22/the-core-performance-fundamentals-of-oracle-data-warehousing-balanced-hardware-configuration/">balanced system</a> and frequently has much more CPU power than disk and channel&nbsp;bandwidth, so data movement from disk becomes the bottleneck well before the CPUs are busy. &nbsp;This results in many&nbsp;statements&nbsp;like &#8220;Parallel Execution doesn&#8217;t work&#8221; or&nbsp;similar&nbsp;because the user/DBA isn&#8217;t observing a decrease in execution time with more&nbsp;parallelism. &nbsp;Bottom line: &nbsp;if the hardware resources are not available, the software certainly can not scale.</p>
<p>Just for giggles (and education), here is a&nbsp;snippet&nbsp;from <a href="http://linux.die.net/man/1/top">top(1)</a> from a node from an Oracle Database Machine running a single query (across all 8 database nodes) at DOP 256.</p>
<pre style="text-align: left;">top - 20:46:44 up 5 days,  3:48,  1 user,  load average: 36.27, 37.41, 35.75
Tasks: 417 total,  43 running, 373 sleeping,   0 stopped,   1 zombie
Cpu(s): 95.6%us,  1.6%sy,  0.0%ni,  2.2%id,  0.0%wa,  0.2%hi,  0.4%si,  0.0%st
Mem:  74027752k total, 21876824k used, 52150928k free,   440692k buffers
Swap: 16771852k total,        0k used, 16771852k free, 13770844k cached

USER       PID  PR  NI  VIRT  SHR  RES S %CPU %MEM    TIME+  COMMAND
oracle   16132  16   0 16.4g 5.2g 5.4g R 63.8  7.6 709:33.02 ora_p011_orcl
oracle   16116  16   0 16.4g 4.9g 5.1g R 60.9  7.2 698:35.63 ora_p003_orcl
oracle   16226  15   0 16.4g 4.9g 5.1g R 59.9  7.2 702:01.01 ora_p028_orcl
oracle   16110  16   0 16.4g 4.9g 5.1g R 58.9  7.2 697:20.51 ora_p000_orcl
oracle   16122  15   0 16.3g 4.9g 5.0g R 56.9  7.0 694:54.61 ora_p006_orcl</pre>
<p>(Quite the TIME+&nbsp;column&nbsp;there, huh!)</p>
<p><strong>Summary</strong></p>
<p><strong><span style="font-weight: normal;">In this post I&#8217;ve been a bit light on the technicals of PX, but that is mostly because 1) this is a fundamentals post and 2) there is a ton of more detail in the referenced&nbsp;documentation&nbsp;and I really don&#8217;t feel like republishing what already exists. Bottom line, Oracle Parallel Execution is a must for scaling performance in your Oracle data warehouse. &nbsp;Take the time to understand how to leverage it to maximize performance in your environment and feel free to start a discussion here if you have questions. </span></strong></p>
<p><strong>References</strong></p>
<ul>
<li><a href="http://download.oracle.com/docs/cd/E11882_01/server.112/e10713/process.htm#CNCPT220">Concepts: Parallel Execution</a></li>
<li><a href="http://download.oracle.com/docs/cd/E11882_01/server.112/e10837/parallel.htm">VLDB and Partitioning Guide: Using Parallel Execution </a></li>
<li><a href="http://www.oracle.com/technetwork/database/focus-areas/bi-datawarehousing/dbbi-tech-info-sca-090608.html">Parallelism and Scalability for Data Warehousing</a></li>
<li><a href="http://www.oracle.com/technetwork/database/focus-areas/bi-datawarehousing/twp-parallel-execution-fundamentals-133639.pdf">Oracle Database Parallel Execution Fundamentals in Oracle 11g Release 2</a></li>
<li><a href="http://www.oracle.com/technetwork/database/focus-areas/bi-datawarehousing/twp-bidw-parallel-execution-130766.pdf">Parallel Execution and Workload Management</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://structureddata.org/2010/04/19/the-core-performance-fundamentals-of-oracle-data-warehousing-parallel-execution/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>Oracle Parallel Execution: Interconnect Myths And Misunderstandings</title>
		<link>http://structureddata.org/2009/07/06/oracle-parallel-execution-interconnect-myths-and-misunderstandings/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=oracle-parallel-execution-interconnect-myths-and-misunderstandings</link>
		<comments>http://structureddata.org/2009/07/06/oracle-parallel-execution-interconnect-myths-and-misunderstandings/#comments</comments>
		<pubDate>Tue, 07 Jul 2009 00:00:17 +0000</pubDate>
		<dc:creator>Greg Rahn</dc:creator>
				<category><![CDATA[Data Warehousing]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Parallel Execution]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[VLDB]]></category>
		<category><![CDATA[interconnect traffic]]></category>
		<category><![CDATA[parallel query]]></category>

		<guid isPermaLink="false">http://structureddata.org/?p=602</guid>
		<description><![CDATA[A number of weeks back I had come across a paper/presentation by Riyaj Shamsudeen entitled Battle of the Nodes: RAC Performance Myths (avaiable here). As I was looking through it I saw one example that struck me as very odd (Myth #3 &#8211; Interconnect Performance) and I contacted him about it. After further review Riyaj commented that he had made a mistake in his analysis and offered up a new example. I thought I&#8217;d take the time to discuss this as parallel execution seems to be one of those areas where many misconceptions and misunderstandings exist. The Original Example I thought I&#8217;d quickly discuss why I questioned the initial example. The original query Riyaj cited is this one: As you can see this is a very simple single table aggregation without a group by. The reason that I questioned the validity of this example in the context of interconnect performance is that the parallel execution servers (parallel query slaves) will each return exactly one row from the aggregation and then send that single row to the query coordinator (QC) which will then perform the final aggregation. Given that, it would seem impossible that this query could cause any interconnect issues [...]]]></description>
			<content:encoded><![CDATA[<p>A number of weeks back I had come across a paper/presentation by <a href="http://orainternals.wordpress.com/">Riyaj Shamsudeen</a> entitled <em>Battle of the Nodes: RAC Performance Myths</em> (<a href="http://orainternals.wordpress.com/my-papers-and-presentations/">avaiable here</a>).  As I was looking through it I saw one example that struck me as very odd  (Myth #3 &#8211; Interconnect Performance) and I contacted him about it.  <a href="http://orainternals.wordpress.com/2009/06/20/rac-parallel-query-and-udpsnoop/">After further review</a> Riyaj commented that he had made a mistake in his analysis and offered up a new example.  I thought I&#8217;d take the time to discuss this as parallel execution seems to be one of those areas where many misconceptions and misunderstandings exist.</p>
<p><strong>The Original Example</strong></p>
<p>I thought I&#8217;d quickly discuss why I questioned the initial example.  The original query Riyaj cited is this one:</p>
<pre class="brush: sql; title: ; notranslate">
select /*+ full(tl) parallel (tl,4) */
       avg (n1),
       max (n1),
       avg (n2),
       max (n2),
       max (v1)
from   t_large tl;
</pre>
<p>As you can see this is a very simple single table aggregation without a group by.  The reason that I questioned the validity of this example in the context of interconnect performance is that the parallel execution servers (parallel query slaves) will each return exactly one row from the aggregation and then send that single row to the query coordinator (QC) which will then perform the final aggregation.  Given that, it would seem impossible that this query could cause any interconnect issues at all.</p>
<p><strong> Riyaj&#8217;s Test Case #1</strong></p>
<p>Recognizing the original example was somehow flawed, Riyaj came up with a new example (I&#8217;ll reference as TC#1) which consisted of the following query:</p>
<pre class="brush: sql; title: ; notranslate">
select /*+ parallel (t1, 8,2) parallel (t2, 8, 2)  */
       min (t1.customer_trx_line_id + t2.customer_trx_line_id),
       max (t1.set_of_books_id + t2.set_of_books_id),
       avg (t1.set_of_books_id + t2.set_of_books_id),
       avg (t1.quantity_ordered + t2.quantity_ordered),
       max (t1.attribute_category),
       max (t2.attribute1),
       max (t1.attribute2)
from   (select *
        from   big_table
        where  rownum &amp;lt;= 100000000) t1,
       (select *
        from   big_table
        where  rownum &amp;lt;= 100000000) t2
where  t1.customer_trx_line_id = t2.customer_trx_line_id;
</pre>
<p>The execution plan for this query is:</p>
<pre>
----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |           |     1 |   249 |       |  2846K  (4)| 01:59:01 |        |      |            |
|   1 |  SORT AGGREGATE           |           |     1 |   249 |       |            |          |        |      |            |
|*  2 |   HASH JOIN               |           |   100M|    23G|   762M|  2846K  (4)| 01:59:01 |        |      |            |
|   3 |    VIEW                   |           |   100M|    10G|       |  1214K  (5)| 00:50:46 |        |      |            |
|*  4 |     COUNT STOPKEY         |           |       |       |       |            |          |        |      |            |
|   5 |      PX COORDINATOR       |           |       |       |       |            |          |        |      |            |
|   6 |       PX SEND QC (RANDOM) | :TQ10000  |   416M|  6749M|       |  1214K  (5)| 00:50:46 |  Q1,00 | P->S | QC (RAND)  |
|*  7 |        COUNT STOPKEY      |           |       |       |       |            |          |  Q1,00 | PCWC |            |
|   8 |         PX BLOCK ITERATOR |           |   416M|  6749M|       |  1214K  (5)| 00:50:46 |  Q1,00 | PCWC |            |
|   9 |          TABLE ACCESS FULL| BIG_TABLE |   416M|  6749M|       |  1214K  (5)| 00:50:46 |  Q1,00 | PCWP |            |
|  10 |    VIEW                   |           |   100M|    12G|       |  1214K  (5)| 00:50:46 |        |      |            |
|* 11 |     COUNT STOPKEY         |           |       |       |       |            |          |        |      |            |
|  12 |      PX COORDINATOR       |           |       |       |       |            |          |        |      |            |
|  13 |       PX SEND QC (RANDOM) | :TQ20000  |   416M|    10G|       |  1214K  (5)| 00:50:46 |  Q2,00 | P->S | QC (RAND)  |
|* 14 |        COUNT STOPKEY      |           |       |       |       |            |          |  Q2,00 | PCWC |            |
|  15 |         PX BLOCK ITERATOR |           |   416M|    10G|       |  1214K  (5)| 00:50:46 |  Q2,00 | PCWC |            |
|  16 |          TABLE ACCESS FULL| BIG_TABLE |   416M|    10G|       |  1214K  (5)| 00:50:46 |  Q2,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T1"."n1"="T2"."n1")
   4 - filter(ROWNUM<=100000000)
   7 - filter(ROWNUM<=100000000)
  11 - filter(ROWNUM<=100000000)
  14 - filter(ROWNUM<=100000000)
</pre>
<p>
<p>This is a rather synthetic query but there are a few things that I would like to point out.  First, this query uses a parallel hint with 3 values representing table/degree/instances, however instances has been deprecated (see <a href="http://download.oracle.com/docs/cd/B10501_01/server.920/a96540/sql_elements7a.htm#8477">10.2 parallel hint documentation</a>).  In this case the DOP is calculated by degree * instances or 16, not DOP=8 involving 2 instances.   Note that the rownum filter is causing all the rows from the tables to be sent back to the QC for the COUNT STOPKEY operation thus causing the execution plan to serialize, denoted by the P->S in the IN-OUT column.</p>
<p>Riyaj had enabled sql trace for the QC and the TKProf output is such:</p>
<pre class="brush: sql; title: ; notranslate">
Rows     Row Source Operation
-------  ---------------------------------------------------
      1  SORT AGGREGATE (cr=152 pr=701158 pw=701127 time=1510221226 us)
98976295   HASH JOIN  (cr=152 pr=701158 pw=701127 time=1244490336 us)
100000000    VIEW  (cr=76 pr=0 pw=0 time=200279054 us)
100000000     COUNT STOPKEY (cr=76 pr=0 pw=0 time=200279023 us)
100000000      PX COORDINATOR  (cr=76 pr=0 pw=0 time=100270084 us)
      0       PX SEND QC (RANDOM) :TQ10000 (cr=0 pr=0 pw=0 time=0 us)
      0        COUNT STOPKEY (cr=0 pr=0 pw=0 time=0 us)
      0         PX BLOCK ITERATOR (cr=0 pr=0 pw=0 time=0 us)
      0          TABLE ACCESS FULL BIG_TABLE_NAME_CHANGED_12 (cr=0 pr=0 pw=0 time=0 us)
100000000    VIEW  (cr=76 pr=0 pw=0 time=300298770 us)
100000000     COUNT STOPKEY (cr=76 pr=0 pw=0 time=200298726 us)
100000000      PX COORDINATOR  (cr=76 pr=0 pw=0 time=200279954 us)
      0       PX SEND QC (RANDOM) :TQ20000 (cr=0 pr=0 pw=0 time=0 us)
      0        COUNT STOPKEY (cr=0 pr=0 pw=0 time=0 us)
      0         PX BLOCK ITERATOR (cr=0 pr=0 pw=0 time=0 us)
      0          TABLE ACCESS FULL BIG_TABLE_NAME_CHANGED_12 (cr=0 pr=0 pw=0 time=0 us)
</pre>
<p>Note that the Rows column contains zeros for many of the row sources because this trace is only for the QC, not the slaves, and thus only QC rows will show up in the trace file.  Something to be aware of if you decide to use sql trace with parallel execution.</p>
<p><strong>Off To The Lab: Myth Busting Or Bust!</strong></p>
<p>I wanted to take a query like TC#1 and run it in my own environment so I could do more monitoring of it.  Given the alleged myth had to do with interconnect traffic of cross-instance (inter-node) parallel execution, I wanted to be certain to gather the appropriate data.  I ran several tests using a similar query on a similar sized data set (by row count) as the initial example.  I ran all my experiments on a Oracle Real Application Clusters version 11.1.0.7 consisting of eight nodes, each with two quad-core CPUs.  The interconnect is InfiniBand and the protocol used is RDS (Reliable Datagram Sockets).</p>
<p>Before I get into the experiments I think it is worth mentioning that Oracle's parallel execution (PX), which includes Parallel Query (PQ), PDML &#038; PDDL, can consume vast amounts of resources.  This is by design.  You see, the idea of Oracle PX is to dedicate a large amount of resources (processes) to a problem by breaking it up into many smaller pieces and then operate on those pieces in parallel.  Thus the more parallelism that is used to solve a problem, the more resources it will consume, assuming those resources are available.  That should be fairly obvious, but I think it is worth stating.</p>
<p>For my experiments I used a table that contains just over 468M rows.</p>
<p>Below is my version of TC#1.  The query is a self-join on a unique key and the table is range partitioned by DAY_KEY into 31 partitions.  Note that I create a AWR snapshot immediately before and after the query.</p>
<pre class="brush: sql; title: ; notranslate">
exec dbms_workload_repository.create_snapshot

select /* &amp;amp;&amp;amp;1 */
       /*+ parallel (t1, 16) parallel (t2, 16) */
       min (t1.bsns_unit_key + t2.bsns_unit_key),
       max (t1.day_key + t2.day_key),
       avg (t1.day_key + t2.day_key),
       max (t1.bsns_unit_typ_cd),
       max (t2.curr_ind),
       max (t1.load_dt)
from   dwb_rtl_trx t1,
       dwb_rtl_trx t2
where  t1.trx_nbr = t2.trx_nbr;

exec dbms_workload_repository.create_snapshot
</pre>
<p><strong>Experiment Results Using Fixed DOP=16</strong></p>
<p>I ran my version of TC#1 across a varying number of nodes by using Oracle services (instance_groups and parallel_instance_group have been deprecated in 11g), but kept the DOP constant at 16 for all the tests.  Below is a table of the experiment results.</p>
<table border="1" align="center">
<tr>
<td>
<div align="center"><strong>Nodes</strong></div>
</td>
<td>
<div align="center"><strong>Elapsed Time</strong></div>
</td>
<td>
<div align="center"><strong>SQL Monitor Report</strong></div>
</td>
<td>
<div align="center"><strong>AWR Report</strong></div>
</td>
<td>
<div align="center"><strong>AWR SQL Report</strong></div>
</td>
</tr>
<tr>
<td>
<div align="center">1</div>
</td>
<td>
<div align="center">00:04:54.12</div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range/1/a6r9zzu06tudh.htm" target="_blank">a6r9zzu06tudh.htm</a></div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range/1/awrrpt_1_910_911.html">awrrpt_1_910_911.html</a></div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range/1/awrsqlrpt_1_910_911.html">awrsqlrpt_1_910_911.html</a></div>
</td>
</tr>
<tr>
<td>
<div align="center">2</div>
</td>
<td>
<div align="center">00:03:55.35</div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range/2/54patfpds4pp3.htm" target="_blank">54patfpds4pp3.htm</a></div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range/2/awrrpt_2_912_913.html">awrrpt_2_912_913.html</a></div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range/2/awrsqlrpt_2_912_913.html">awrsqlrpt_2_912_913.html</a></div>
</td>
</tr>
<tr>
<td>
<div align="center">4</div>
</td>
<td>
<div align="center">00:02:59.24</div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range/4/dgyay259941s4.htm" target="_blank">dgyay259941s4.htm</a></div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range/4/awrrpt_4_914_915.html">awrrpt_4_914_915.html</a></div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range/4/awrsqlrpt_4_914_915.html">awrsqlrpt_4_914_915.html</a></div>
</td>
</tr>
<tr>
<td>
<div align="center">8</div>
</td>
<td>
<div align="center">00:02:14.39</div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range/8/7b1a4ngy9q7kc.htm" target="_blank">7b1a4ngy9q7kc.htm</a></div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range/8/awrrpt_3_916_917.html">awrrpt_3_916_917.html</a></div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range/8/awrsqlrpt_3_916_917.html">awrsqlrpt_3_916_917.html</a></div>
</td>
</tr>
</table>
<p>Seemingly contrary to what many people would probably guess, the execution times got better the more nodes that participated in the query even though the DOP constant throughout each of tests.</p>
<p><strong>Measuring The Interconnect Traffic</strong></p>
<p>One of the new additions to the AWR report in 11g was the inclusion of interconnect traffic by client.  This section is near the bottom of the report and looks like such:<br />
<img src="http://structureddata.org/files/pq/InterconnectThroughputByClient.png" alt="Interconnect Throughput By Client" /><br />
This allows the PQ message traffic to be tracked, whereas in prior releases it was not.</p>
<p>Even though AWR contains the throughput numbers (as in megabytes per second) I thought it would be interesting to see how much data was being transferred, so I used the following query directly against the AWR data.  I put a filter predicate on to return only where there DIFF_RECEIVED_MB >= 10MB so the instances that were not part of the execution are filtered out, as well as the single instance execution.</p>
<pre class="brush: sql; title: ; notranslate">
break on snap_id skip 1
compute sum of DIFF_RECEIVED_MB on SNAP_ID
compute sum of DIFF_SENT_MB on SNAP_ID

select *
from   (select   snap_id,
                 instance_number,
                 round ((bytes_sent - lag (bytes_sent, 1) over
                   (order by instance_number, snap_id)) / 1024 / 1024) diff_sent_mb,
                 round ((bytes_received - lag (bytes_received, 1) over
                   (order by instance_number, snap_id)) / 1024 / 1024) diff_received_mb
        from     dba_hist_ic_client_stats
        where    name = 'ipq' and
                 snap_id between 910 and 917
        order by snap_id,
                 instance_number)
where  snap_id in (911, 913, 915, 917) and
       diff_received_mb &amp;gt;= 10
/

SNAP_ID    INSTANCE_NUMBER DIFF_SENT_MB DIFF_RECEIVED_MB
---------- --------------- ------------ ----------------
       913               1        11604            10688
                         2        10690            11584
**********                 ------------ ----------------
sum                               22294            22272

       915               1         8353             8350
                         2         8133             8418
                         3         8396             8336
                         4         8514             8299
**********                 ------------ ----------------
sum                               33396            33403

       917               1         5033             4853
                         2         4758             4888
                         3         4956             4863
                         4         5029             4852
                         5         4892             4871
                         6         4745             4890
                         7         4753             4889
                         8         4821             4881
**********                 ------------ ----------------
sum                               38987            38987
</pre>
<p>As you can see from the data, the more nodes that were involved in the execution, the more interconnect traffic there was, however, the execution times were best with 8 nodes.</p>
<p><strong>Further Explanation Of Riyaj's Issue</strong></p>
<p>If you read Riyaj's post, you noticed that he observed worse, not better as I did, elapsed times when running on two nodes versus one.  How could this be?  It was noted in the comment thread of that post that the configuration was using Gig-E as the interconnect in a Solaris IPMP active-passive configuration.  This means the interconnect speeds would be capped at 128MB/s (1000Mbps), the wire speed of Gig-E.  This is by all means is an inadequate configuration to use cross-instance parallel execution.</p>
<p>There is a whitepaper entitled <a href="http://www.oracle.com/technology/products/bi/db/11g/pdf/twp_bidw_parallel_execution_11gr1.pdf" class="broken_link"><em>Oracle SQL Parallel Execution</em></a> that discusses many of the aspects of Oracle's parallel execution.  I would highly recommend reading it.  This paper specifically mentions:</p>
<blockquote><p>If you use a relatively weak interconnect, relative to the I/O bandwidth from the server to the storage configuration, then you may be better of restricting parallel execution to a single node or to a limited number of nodes; inter-node parallel execution will not scale with an undersized interconnect.</p></blockquote>
<p>I would assert that this is precisely the root cause (insufficient interconnect bandwidth for cross-instance PX) behind the issues that Riyaj observed, thus making his execution slower on two nodes than one node.</p>
<p><strong>The Advantage Of Hash Partitioning/Subpartitioning And Full Partition-Wise Joins </strong></p>
<p>At the end of <a href="http://orainternals.wordpress.com/2009/06/20/rac-parallel-query-and-udpsnoop/#comment-328">my comment</a> on Riyaj's blog, I mentioned:</p>
<blockquote><p>If a DW frequently uses large table to large table joins, then hash partitioning or subpartitioning would yield added gains as partition-wise joins will be used.</p></blockquote>
<p>I thought that it would be both beneficial and educational to extend TC#1 and implement hash subpartitioning so that the impact could be measured on both query elapsed time and interconnect traffic.  In order for a full partition-wise join to take place, the table must be partitioned/subpartitioned on the join key column, so in this case I've hash subpartitioned on TRX_NBR.  See the <a href="http://download.oracle.com/docs/cd/B28359_01/server.111/b32024/part_avail.htm#sthref414">Oracle Documentation on Partition-Wise Joins</a> for a more detailed discussion on PWJ.</p>
<p><strong>Off To The Lab: Partition-Wise Joins</strong></p>
<p>I've run through the exact same test matrix with the new range/hash partitioning model and below are the results.</p>
<table border="1" align="center">
<tr>
<td>
<div align="center"><strong>Nodes</strong></div>
</td>
<td>
<div align="center"><strong>Elapsed Time</strong></div>
</td>
<td>
<div align="center"><strong>SQL Monitor Report</strong></div>
</td>
<td>
<div align="center"><strong>AWR Report</strong></div>
</td>
<td>
<div align="center"><strong>AWR SQL Report</strong></div>
</td>
</tr>
<tr>
<td>
<div align="center">1</div>
</td>
<td>
<div align="center">00:02:42.41</div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range-hash/1/arty65g64fmt7.htm" target="_blank">arty65g64fmt7.htm</a></div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range-hash/1/awrrpt_1_1041_1042.html">awrrpt_1_1041_1042.html</a></div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range-hash/1/awrsqlrpt_1_1041_1042.html">awrsqlrpt_1_1041_1042.html</a></div>
</td>
</tr>
<tr>
<td>
<div align="center">2</div>
</td>
<td>
<div align="center">00:01:37.29</div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range-hash/2/crqv3q3x9rtgt.htm" target="_blank">crqv3q3x9rtgt.htm</a></div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range-hash/2/awrrpt_2_1043_1044.html">awrrpt_2_1043_1044.html</a></div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range-hash/2/awrsqlrpt_2_1043_1044.html">awrsqlrpt_2_1043_1044.html</a></div>
</td>
</tr>
<tr>
<td>
<div align="center">4</div>
</td>
<td>
<div align="center">00:01:12.82</div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range-hash/4/5yv7yvjgjxugg.htm" target="_blank">5yv7yvjgjxugg.htm</a></div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range-hash/4/awrrpt_4_1045_1046.html">awrrpt_4_1045_1046.html</a></div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range-hash/4/awrsqlrpt_4_1045_1046.html">awrsqlrpt_4_1045_1046.html</a></div>
</td>
</tr>
<tr>
<td>
<div align="center">8</div>
</td>
<td>
<div align="center">00:01:05.04</div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range-hash/8/8dkv0z9wm9881.htm" target="_blank">8dkv0z9wm9881.htm</a></div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range-hash/8/awrrpt_3_1047_1048.html">awrrpt_3_1047_1048.html</a></div>
</td>
<td>
<div align="center"><a href="http://structureddata.org/files/pq/range-hash/8/awrsqlrpt_3_1047_1048.html">awrsqlrpt_3_1047_1048.html</a></div>
</td>
</tr>
</table>
<p>As you can see by the elapsed times, the range/hash partitioning model with the full partition-wise join has decreased the overall execution time by around a factor of 2X compared to the range only partitioned version.</p>
<p>Now let's take a look at the interconnect traffic for the PX messages:</p>
<pre class="brush: sql; title: ; notranslate">
break on snap_id skip 1
compute sum of DIFF_RECEIVED_MB on SNAP_ID
compute sum of DIFF_SENT_MB on SNAP_ID

select *
from   (select   snap_id,
                 instance_number,
                 round ((bytes_sent - lag (bytes_sent, 1) over
                   (order by instance_number, snap_id)) / 1024 / 1024) diff_sent_mb,
                 round ((bytes_received - lag (bytes_received, 1) over
                   (order by instance_number, snap_id)) / 1024 / 1024) diff_received_mb
        from     dba_hist_ic_client_stats
        where    name = 'ipq' and
                 snap_id between 1041 and 1048
        order by snap_id,
                 instance_number)
where  snap_id in (1042,1044,1046,1048) and
       diff_received_mb &amp;gt;= 10
/
no rows selected
</pre>
<p>Hmm.  No rows selected?!?  I had previously put in the predicate "DIFF_RECEIVED_MB >= 10MB" to filter out the nodes that were not participating in the parallel execution.  Let me remove that predicate rerun the query.</p>
<pre class="brush: sql; title: ; notranslate">
   SNAP_ID INSTANCE_NUMBER DIFF_SENT_MB DIFF_RECEIVED_MB
---------- --------------- ------------ ----------------
      1042               1            8                6
                         2            2                3
                         3            2                3
                         4            2                3
                         5            2                3
                         6            2                3
                         7            2                3
                         8            2                3
**********                 ------------ ----------------
sum                                  22               27

      1044               1            7                7
                         2            3                2
                         3            2                2
                         4            2                2
                         5            2                2
                         6            2                2
                         7            2                2
                         8            2                2
**********                 ------------ ----------------
sum                                  22               21

      1046               1            1                2
                         2            1                2
                         3            1                2
                         4            3                1
                         5            1                1
                         6            1                1
                         7            1                1
                         8            1                1
**********                 ------------ ----------------
sum                                  10               11

      1048               1            6                5
                         2            1                2
                         3            3                2
                         4            1                2
                         5            1                2
                         6            1                2
                         7            1                2
                         8            1                2
**********                 ------------ ----------------
sum                                  15               19
</pre>
<p>Wow, there is almost no interconnect traffic at all.  Let me verify with the AWR report from the 8 node execution.</p>
<p><img src="http://structureddata.org/files/pq/range-hash/8node-pwj-ic-traffic.png"></p>
<p>The AWR report confirms that there is next to no interconnect traffic for the PWJ version of TC#1.  The reason for this is that since the table is hash subpartitoned on the join column each of the subpartitions can be joined to each other minimizing the data sent between parallel execution servers.  If you look at the execution plan (see the AWR SQL Report) from the first set of experiments you will notice that the broadcast method for each of the tables is HASH but in the range/hash version of TC#1 there is no broadcast at all for either of the two tables.  The full partition-wise join behaves logically the same way that a shared-nothing database would; each of the parallel execution servers works on its partition which does not require data from any other partition because of the hash partitioning on the join column.  The main difference is that in a shared-nothing database the data is physically hash distributed amongst the nodes (each node contains a subset of all the data) where as all nodes in a Oracle RAC database have access to the all the data.</p>
<p><strong>Parting Thoughts</strong></p>
<p>Personally I see no myth about cross-instance (inter-node) parallel execution and interconnect traffic, but frequently I see misunderstandings and misconceptions.  As shown by the data in my experiment, TC#1 (w/o hash subpartitioning) running on eight nodes is more than 2X faster than running on one node using exactly the same DOP.  Interconnect traffic is not a bad thing as long as the interconnect is designed to support the workload.  Sizing the interconnect is really no different than sizing any other component of your cluster (memory/CPU/disk space/storage bandwidth).  If it is undersized, performance will suffer.  Depending on the number and speed of the host CPUs and the speed and bandwidth of the interconnect, your results may vary.</p>
<p>By hash subpartioning the table the interconnect traffic was all but eliminated and the query execution times were around 2X faster than the non-hash subpartition version of TC#1.  This is obviously a much more scalable solution and one of the main reasons to leverage hash (sub)partitioning in a data warehouse.</p>
]]></content:encoded>
			<wfw:commentRss>http://structureddata.org/2009/07/06/oracle-parallel-execution-interconnect-myths-and-misunderstandings/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Page Caching using disk: enhanced (User agent is rejected)
Database Caching 39/44 queries in 0.010 seconds using disk: basic
Object Caching 615/725 objects using disk: basic

Served from: structureddata.org @ 2012-05-17 09:06:29 -->
