The Core Performance Fundamentals Of Oracle Data Warehousing – Set Processing vs Row Processing

[back to Introduction]

In over six years of doing data warehouse POCs and benchmarks for clients there is one area that I frequently see as problematic: “batch jobs”.  Most of the time these “batch jobs” take the form of some PL/SQL procedures and packages that generally perform some data load, transformation, processing or something similar.  The reason these are so problematic is that developers have hard-coded “slow” into them.  I’m generally certain these developers didn’t know they had done this when they coded their PL/SQL, but none the less it happened.

So How Did “Slow” Get Hard-Coded Into My PL/SQL?

Generally “slow” gets hard-coded into PL/SQL because the PL/SQL developer(s) took the business requirements and did a “literal translation” of each rule/requirement one at a time instead of looking at the “before picture” and the “after picture” and determining the most efficient way to make those data changes.  Many times this can surface as cursor based row-by-row processing, but it also can appear as PL/SQL just running a series of often poorly thought out SQL commands.

Hard-Coded Slow Case Study

The following is based on a true story. Only the facts names have been changed to protect the innocent.

Here is a pseudo code snippet based on a portion of some data processing I saw in a POC:

{truncate all intermediate tables}
insert into temp1 select * from t1 where create_date = yesterday;
insert into temp1 select * from t2 where create_date = yesterday;
insert into temp1 select * from t3 where create_date = yesterday;
insert into temp2 select * from temp1 where {some conditions};
insert into target_table select * from temp2;
for each of 20 columns
loop
  update target_table t
    set t.column_name =
      (select column_name
       from t4
       where t.id=t4.id )
    where i.column_name is null;
end loop
update target_table t set {list of 50 columns} = select {50 columns} from t5 where t.id=t5.id;

I’m going to stop there as any more of this will likely make you cry more than you already should be.

I almost hesitate to ask the question, but isn’t it quite obvious what is broken about this processing?  Here’s the major inefficiencies as I see them:

  • What is the point of inserting all the data into temp1, only then to filter some of it out when temp2 is populated.  If you haven’t heard the phrase “filter early” you have some homework to do.
  • Why publish into the target_table and then perform 20 single column updates, followed by a single 50 column update?  Better question yet: Why perform any bulk updates at all?  Bulk updates (and deletes) are simply evil – avoid them at all costs.

So, as with many clients that come in and do an Exadata Database Machine POC, they really weren’t motivated to make any changes to their existing code, they just wanted to see how much performance the Exadata platform would give them.  Much to their happiness, this reduced their processing time from over 2.5 days (weekend job that started Friday PM but didn’t finish by Monday AM) down to 10 hours, a savings of over 2 days (24 hours).  Now, it could fail and they would have time to re-run it before the business opened on Monday morning.  Heck, I guess if I got back 24 hours out of 38 I’d be excited too, if I were not a database performance engineer who knew there was even more performance left on the table, waiting to be exploited.

Feeling unsatisfied, I took it upon myself to demonstrate the significant payback that re-engineering can yield on the Exadata platform and I coded up an entirely new set-based data flow in just a handful of SQL statements (no PL/SQL).  The result: processing an entire week’s worth of data (several 100s of millions of rows) now took just 12 minutes.  That’s right — 7 days worth of events scrubbed, transformed, enriched and published in just 12 minutes.

When I gently broke the news to this client that it was possible to load the week’s events in just 12 minutes they were quite excited (to say the least).  In fact, one person even said (a bit out of turn), “well, that would mean that a single day’s events could be loaded in just a couple minutes and that would give a new level of freshness to the data which would allow the business to make faster, better decisions due to the timeliness of the data.”  My response: “BINGO!”  This client now had the epiphany of what is now possible with Exadata where previously it was impossible.

It’s Not a Need, It’s a Want

I’m not going to give away my database engineer hat for a product marketing hat just yet (or probably ever), but this is the reality that exists.  IT shops started with small data sets and use small data set programming logic on their data, and that worked for some time.  The reason: because inefficient processing on a small data set is only a little inefficient, but the same processing logic on a big data set is very inefficient.  This is why I have said before: In oder to fully exploit the Oracle Exadata platform (or any current day platform) some re-engineering may be required. Do not be mistaken — I am not saying you need to re-engineer your applications for Exadata.  I am saying you will want to re-engineer your applications for Exadata as those applications simply were not designed to leverage the massively parallel processing that Exadata allows one to do.  It’s time to base design decisions based on today’s technology, not what was available when your application was designed.  Fast forward to now.

21 comments

  1. Peter Scott

    Another cause of slow-by-row-by slow is the use of some ETL tools – just because a few lines on the GUI looks efficient, it does not mean it is quick under the covers Worse still the “enterprise” tools that insist of doing the heavy stuff in their own engine and not using the power of the target (any target!) to do the heavy lifting. Makes me weep sometimes. Managements (in the generic sense) often have more of a problem dealing with having made a poor (or outlived) tool choice then having used a programmer to code poorly

    I have long believed in the sets appeal.

  2. Karsten

    I suppose it is better to demonstrate that optimisation after the customer already bought in for new hardware. Now that phenomenal speed up is all attributed to the new hardware.
    I wonder what kind of performance they would see with an optimal etl process on the old iron. — perhaps they would not even consider exadata.

  3. Graham Oakes

    A question I would be asking… If they improve their code do they really need to use Exadata? Day in, day out, I see people looking for new hardware as the solution to their performance issues before they look at tuning their code. Is this a sign of the times?

  4. Nigel Thomas

    Hmm, 10 hours down to 12 minutes (x50 improvement). Making the admittedly heroic assumption that they could have made similar improvements on their previous h/w, that would have reduced processing time from 2.5 days (60 hours) to 1.2 hours (72 mins). Or just over 10 minutes per day.

    I hope they needed Exadata for something else!

    However, the tendency to row-based processing is very common (as has been reported for as long as I can remember – since the mid 80s at least). Changes to requirements over time (either during development, or after initial deployment) also tend to get added in piecemeal rather than revisiting the entire flow.

  5. Pingback: The Core Performance Fundamentals Of Oracle Data Warehousing – Introduction
  6. Greg Rahn

    @Karsten

    While it is very likely that the process could be better engineered on the current platform, there was no where near the hardware resources to run it in the same mannor at I did using the Exadata Database Machine.

  7. Greg Rahn

    @Graham Oakes

    I agree with you 110% that IT shops look to simply buy performance by purchasing a new hardware (or even new software) and so some extent they can. However, if you really want a big bang factor one needs to asses why something is actually slow — and many times the reason is not because of the hardware or DBMS, it’s because it was engineered to be slow. So yes, I agree there was likely room for improvement on the current platform, but I don’t think they could achieve anywhere near what was seen on the Exadata Database Machine. There simply were not the hardware resources or the software smarts to do so.

    Quite a bit of this stems from being a “victim of your environment”. If you are unable to do something a certain way you fine a “creative” way to work around that limitation. Not to sound like too much of a marketeer, but Exadata lifts quite a few of those limitations and opens up opportunity for new ways to address problems.

  8. Greg Rahn

    @Nigel Thomas

    The “could they have made similar gains” question comes up frequently, and is a very valid question, but in reality there really is no way that is possible. Some measurable gains, sure. But no where near what they observed on the Exadata Database Machine. Just for a start, the system would have to read and write data as fast (or in the same range) and there is likely close to two orders of magnitude difference there (100s of MB/s vs >20 GB/s). I don’t remember all the details of the current platform, but it was very typical of what I have seen for an “existing system”. That being hardware that is 4-6 years old (or more), 2 or maybe 4 if you are lucky HBAs plumbed to some three letter storage vendor’s array that can’t even run the fibre at wire speed even when stuff full of drives. Now you compare that to a system (1 rack Exadata Database Machine) that has a physical I/O rate of over 20GB/s and software technologies that reduce the payload sent from the storage (Smart Scans), further eliminate unnecessary I/O (Storage Indexes) in addition to increasing the logical I/O scan rate (Exadata Hybrid Columnar Compression) and you have quite a recipe for fast.

  9. Greg Rahn

    @Peter Scott

    Great point to bring up. I’ve also seen way too much of the suck-and-blow (take data out of the DBMS, change it, then write it back) with ETL tools. The best way is simply SQL pushdown — make the warehouse platform run this stuff all in the database. I think many of the ETL tool vendors are working on this now, or have a very recent version that has this functionality, but it can certainly can not come soon enough. The issue has been around for some time.

  10. JulesLt

    It’s not just management, though – a lot of developers like to think of databases – sorry, persistence layers – as places where objects go to sleep – rather than places where data goes to work.

    (In fact, a lot of these people would complain about management imposing Oracle on them).

    I was trying to explain this to a non-programmer and I ended up using the same thing you are expressing here – that good database programming is about thinking in terms of sets, whereas OO thinking starts with the Object.

    It’s telling, for instance, how long it took Java to introduce the concept of typed collections, rather than bags of any old object. Equally one of the strengths of Ruby and Python, and functional programming languages, is less that they are ‘dynamic’ but that they offer a simpler syntax for expressing that very common coding construct – ‘Do X to all items in this set’.

  11. Tim Hall

    Hi.

    Abuse of PL/SQL is the same as abuse of any programming language. Any programmer interacting with a database needs to be great at SQL first, then worry about their other skills, whether they are PL/SQL, Java, .NET etc.

    I teach a PL/SQL course for Oracle University and one section that always raises eyebrows is when I talk about improving PL/SQL performance by not using PL/SQL. If you are programming PL/SQL like it is Oracle 7.3, then there are lots of things you do in PL/SQL that you can now do in DML directly. People solve problems with the skills they are most comfortable with.

    In some cases, like the one shown in the original post, the solution is just plain bad, but in some cases the solution is correct for the era in which the load process was coded. If time and money had been available, it may have been refactored with each new DB version update.

    Karsten: I agree that spending money on Exadata when you have such a badly coded app seems a little silly. If you are going to have to do the refactoring, why not do it first, then decide if you actually need the extra horsepower. :)

    Peter: Don’t even get me started on tools that generate SQL. :)

    Cheers

    Tim…

  12. Greg Rahn

    @Tim Hall

    I’m adding this to my book of quotes:

    “I [Tim Hall] talk about improving PL/SQL performance by not using PL/SQL.”

    Since you and the others think, and right fully so, that it may seem a bit extreme to pursue Exadata before considering re-engineering, but the rest of the picture hasn’t been filled in yet by me. You see, query performance is a big part of this as well, but as you probably know, if you have reasonable design and SQL, Oracle’s Parallel Execution (Query) engine takes over and evokes the massive parallelism behind the scenes — there’s really no way to code in slow to queries. This is a much different scenario than batch programs. In this case the current system could not even compare to the query performance they saw with Exadata. Some of the differences were so large I dare not mention them. ;)

  13. Kumar

    Greg:
    “Bulk updates (and deletes) are simply evil – avoid them at all costs.”
    Looking at the example you had given above, did you mean co-related updates and deletes and not Bulk Fetching and processing in PL/SQL

    Thank you
    Kumar

  14. Greg Rahn

    @Kumar

    Bulk updates and deletes using the UPDATE and DELETE SQL commands. Using BULK COLLECT in PL/SQL is a completely different topic.

  15. Jakub Illner

    I am afraid that not even Exadata will help much with the cursor loops in PLSQL. Exadata dramatically improves the IO throughput but will not help much with single threaded, PLSQL loop processing. I have witnessed that Exadata was actually slightly slower compared with an existing system. But once the code was rewritten to use set based processing it just blazed away.

    Also one of the reasons some developers do not use set based processing is that it requires combination of developer and DBA skills such as partition exchange loading, index handling etc. they are not familiar with. I think that tools like ODI can help a lot since they can enforce the use set based processing via the templates (knowledge modules in case of ODI).

  16. Enrique Aviles

    Great story. I sent it to some of my colleagues to show them upgrading the hardware is just part of the performance equation.

    Question: Are you saying we should avoid bulk SQL updates and deletes, PL/SQL bulk collects, or both? If we do that, what else is there if not the dreaded row by row processing? Please clarify!

  17. Enrique Aviles

    @Greg Rahn

    The difference between SQL bulk updates/deletes and PL/SQL bulk collects is clear. The question is if you are recommending to avoid SQL bulk updates/deletes. If that’s the case, what do you recommend instead?

  18. Greg Rahn

    @Enrique Aviles

    Instead of bulk UPDATES/DELETES, use set theory to create the compliment via CTAS (or similar). That is, for DELETE recreate what you want to keep, for UPDATE, recreate with the modifications.

    There are a few reasons for this:
    1) Compression – UPDATES/DELETES are more expensive with compression and they generally leave the data in a much less compressed format
    2) Logging – Both UPDATE and DELETE are logging operations and a NOLOGGING CTAS or INSERT /*+ APPEND */ are not.

    As a result, it is often faster and better (from a compression point of view) to recreate the segment.

  19. Pingback: Log Buffer #196, A Carnival of the Vanities for DBAs | The Pythian Blog
  20. Tim Hall

    @Greg Rahn

    Don’t get me wrong, I have no doubt about the Exadata horsepower. I sat and watched the “Terabyte Hour” session at OpenWorld (OK it was a Exadata V1, but still very impressive). I completely agree that for most people, performance of the re-engineered app on their own kit is not comparible to the performance of the re-engineered app on Exadata. Indeed, some of the decisions on how you re-engineer the app may be affected by your kit, so there may be situations where it is sensible to re-engineer after moving to Exadata.

    The point is, and you have already made it in your post, put a bad app on top great hardware and you are still left with a bad app. It seems like everyone knows this, but it gets conveniently ignored when making hardware purchasing decisions.

    Cheers

    Tim…

    PS. If you have an Exadata V2 box going spare I could make use of it at home… :)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s