This weekend begins Oracle’s OpenWorld and there is something off the beaten path going on — OakTable World. There is a pretty good lineup of speakers, including myself, who will be talking on subjects that probably would not qualify for OpenWorld sessions for various reasons. My talk is entitled “Beyond the Relational Database” and in it I’ll be talking about non-relational (Big Data) technologies and what they have to offer. Be sure to bring an open mind! Hope to see you there.
To make things easy here is an iCal file for my session.
[This post was originally published on 2012/02/29 and was hidden shortly thereafter. I'm un-hiding it as of 2012/05/30 with some minor edits.]
Many Oracle Database users like tools with GUI interfaces because they add features and functionality that are not easily available from the command line interfaces like SQL*Plus. One of the more popular tools from my experiences is Oracle SQL Developer in part because it’s a free tool from Oracle. Given SQL Developer’s current design (as of version 3.1.07.42), some issues frequently show up when using it with Oracle Databases with Parallel Execution. SQL Developer also contains a bug that exacerbates this issue as well.
The crux of the issue with SQL Developer (and possibly other similar tools) and Parallel Execution comes down to how the application uses cursors. By default, SQL Developer has the array fetch size set to 50. This means that for any cursor SQL Developer opens for scolling, it will fetch the first 50 rows and when you scroll to the bottom of those rows in a grid, it will fetch the next 50 rows and so on. The array size can be controlled by going into Properties and changing Database -> Advanced -> SQL Array Fetch Size which allows for a max setting of 500 rows. This is good in the sense that the JDBC application can fetch an array of rows with a single JDBC database call, however, using this approach with Parallel Execution, the PX servers used for this cursor will not be released until the cursor is canceled or the last row is fetched. Currently the only way to force reading until the end of cursor in SQL Developer is to issue a Control+End in the data grid. As a result, any action that uses Parallel Execution and has not fetched all the rows or is not canceled/closed, will squat those Parallel Execution resources and prevent them from being used by other users. If enough users have open cursors backed by Parallel Execution, then it is possible that it could consume all of the Parallel Execution servers and will result in Parallel Execution requests being forced to Serial Execution because resources are not available, even if the system is completely idle.
The SQL Developer Bug
When experimenting with SQL Developer for this blog post I also found and filed a bug (bug 13706186) because it leaks cursors when a user browses data in a table by expanding Tables (in the left pane), clicking on a table name and then the Data tab. Unfortunately this bug adds insult to injury if the table is decorated with a parallel degree attribute because the leaked cursors do not release the Parallel Execution servers until the session is closed, thus preventing other sessions from using them.
This bug is easily demonstrated using the SCOTT schema, but any schema or table will do as long as the table has more rows than the array fetch size. For my example, I’m using a copy of the EMP table, called EMP2, which contains 896 rows and was created using the following SQL:
The steps to demonstrate this issue are as follows:
- Set up the EMP2 table using the above script or equivalent.
- Use SQL Developer to connect to the SCOTT schema or equivalent.
- Expand Tables in the Browser in the left pane.
- Click EMP2.
- Click on the Data tab of EMP2.
- Check the open cursors.
- Close the EMP2 tab in the right pane.
- Check the open cursors.
- Goto Step #4 and repeat.
I’m going to repeat this process two times for a total of three open and close operations. I’ll use this query to show the open cursors for the Data grid query for the EMP2 table (adjust if necessary if you are not using my example):
If we look at the output (scott_emp2_cursors.txt below the EM graphic) from the query we’ll see that the first time the EMP2 Data tab is opened, it opens two identical cursors (sql_exec_id 16777216 & 16777217). After closing the EMP2 Data tab, 16777216 is still open. The second time the EMP2 Data tab is opened, two more identical cursors are opened (sql_exec_id 16777218 & 16777219). The third time two more cursors are opened (sql_exec_id 16777220 & 16777221). After closing the tabs we still see three cursors open (sql_exec_id 16777216, 16777218 & 16777220), each of which are squatting two PX servers.
This behavior can also be seen in 11g Enterprise Manager (or dbconsole) on the SQL Monitoring page by sorting the statements by time — notice the (leaked cusor) statements with the green spinning pinwheels after all tabs have been closed (all parallel statements are monitored by default).
By the way, the cursor leak applies for tables without a parallel degree setting as well, but has more significant impact if the table is parallel because PX servers are a shared resource.
Obviously the cursor leak is a SQL Developer bug and needs fixing, but in the interim, DBAs should be aware that this behavior can have a global impact because Parallel Execution servers are shared by all database users. Also, if SQL Developer users are running Parallel Queries and keep the results grid open but do not fetch all the rows by using the Control+End functionality, those Parallel Execution servers will be unavailable for other users to use and could negatively impact other users queries leveraging Parallel Execution.
Personally I’d like to see a few enhancements to SQL Developer to avoid these pitfalls:
- Disable Parallel Execution for Table Data browsing.
Browsing data in a table via a scrollable grid is a “small data” problem and does not require the “big data” resources of Parallel Execution. This could easily be done by adding a NOPARALLEL hint when SQL Developer builds the query string.
- Add a property with functionality to read all rows w/o requiring the Control+End command (always fetch until end of cursor) or until a max number or rows are read (or a max amount of memory is used for the result set), then close the cursor.
By fetching until end of cursor or fetching a max number of rows and closing the cursor, the client will release any Parallel Execution resources it may have used. Obviously fetching all rows could be a problem with large result sets and cause SQL Developer to run out of memory and crash, which would be a bad user experience, but not releasing PX resources can easily lead to many bad user experiences.
I’ve seen the issue of potentially large result sets dealt with in other JDBC based GUI tools that connect to parallel databases by the tool appending a “LIMIT X” clause to queries where the user can control the value for “X” in a property setting. To the best of my knowledge, no other parallel databases support cursors in the way that Oracle does (being able to fetch rows, pause, then resume fetching), so there is no issue there with squatting resources with them (once they start fetching they must continue until the last row is fetched or the statement is canceled). As of release 11.2, Oracle does not support the LIMIT clause but this functionality could be done in the client by using some some upper limit on “array fetch size” * “number of fetches” or wrapping queries with a “select * from ([query text]) where rownum <= X" or similiar.
There are some “clever” server-side ways to deal with this as well, such as adding a logon trigger that disables parallel query if the V$SESSION.PROGRAM name is “SQL Developer”, but a robust, clean client side solution is preferred by myself and likely other DBAs as well. It’s really just a simple matter of programming.
When using SQL Developer or similar tools, be aware of the potential to squat Parallel Execution resources if the client tool has open cursors. Educate your SQL Developer users on how they can play well with others in an Oracle Database using Parallel Execution by closing unneeded tabs with open cursors. Be aware of the impact of the cursor leak bug in SQL Developer 3.1.07.42 (and possibly previous releases) until it is fixed.
Personally I’d like to see an enhancement to deal with this behavior and I don’t think it would require much programming. It certainly would allow DBAs to feel more confident that SQL Developer is a tool that can be used on production systems and not result in any issues. What are your thoughts? Do my enhancement requests seem reasonable and warranted?
One of the easiest ways to understand something is to see a visualization. Looking at Active Session History (ASH) data is no exception and I’ll dive into how to do so with R and how I used R plots to visually present a problem and confirm a hypothesis. But first some background…
Frequently DBAs use the Automatic Workload Repository (AWR) as an entry point for troubleshooting performance problems and in this case the adventure started the same way. In the AWR report Top 5 Timed Foreground Events, the log file sync event was showing up as the #3 event. This needed deeper investigation as often times the cause for longer log file sync times is related to longer log file parallel write times.
Top 5 Timed Foreground Events ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Avg wait % DB Event Waits Time(s) (ms) time Wait Class ------------------------------ ------------ ----------- ------ ------ ---------- log file sync 3,155,253 9,197 3 6.4 Commit
Drilling into this a bit deeper the two log file events reported the following in the Foreground Wait Events and Background Wait Events sections of the AWR report:
Foreground Wait Events -> s - second, ms - millisecond - 1000th of a second -> Only events with Total Wait Time (s) >= .001 are shown -> ordered by wait time desc, waits desc (idle events last) -> %Timeouts: value of 0 indicates value was ordered by wait time desc, waits desc (idle events last) -> Only events with Total Wait Time (s) >= .001 are shown -> %Timeouts: value of 0 indicates value was < .5%. Value of null is truly 0 Avg %Time Total Wait wait Waits % bg Event Waits -outs Time (s) (ms) /txn time -------------------------- ------------ ----- ---------- ------- -------- ------ log file parallel write 659,157 0 526 1 0.2 18.2
It is generally always worth looking at the breakdown of these times, as they are averages. We really want to understand the entire histogram of these wait events. For that we can look at the Wait Event Histogram section of the AWR report as below.
Wait Event Histogram -> Units for Total Waits column: K is 1000, M is 1000000, G is 1000000000 -> % of Waits: value of .0 indicates value was % of Waits: column heading of <=1s is truly 1s is truly >=1024ms -> Ordered by Event (idle events last) % of Waits ----------------------------------------------- Total Event Waits <1ms <2ms <4ms <8ms <16ms <32ms 1s -------------------------- ------ ----- ----- ----- ----- ----- ----- ----- ----- log file parallel write 661.6K 84.7 9.7 4.7 .4 .1 .3 .0 log file sync 3138.K 14.0 42.8 30.4 7.9 2.3 2.6 .1 Wait Event Histogram Detail (64 msec to 2 sec) -> Units for Total Waits column: K is 1000, M is 1000000, G is 1000000000 -> Units for % of Total Waits: ms is milliseconds s is 1024 milliseconds (approximately 1 second) -> % of Total Waits: total waits for all wait classes, including Idle -> % of Total Waits: value of .0 indicates value was Ordered by Event (only non-idle events are displayed) % of Total Waits ----------------------------------------------- Waits 64ms Event to 2s <32ms <64ms <1/8s <1/4s <1/2s <1s =2s -------------------------- ----- ----- ----- ----- ----- ----- ----- ----- ----- log file parallel write 52 100.0 .0 .0 log file sync 3535 99.9 .1 .0
One thing that you should notice here is there are two sections of Wait Event Histogram; the buckets less than 32ms and buckets greater than 32ms. It is also important to note that only the absence of a value means no timings fell into that bucket — so even though the report shows .0 there are still events in that bucket (read the section description as it mentions this).
We can see from the second histogram section that there were 52 times that log file parallel write was over 64ms as well as 3535 times log file sync was over 64ms. At this point a hypothesis is formed that the two events are correlated — that is, the belief is the long log file parallel write events may be causing the long log file sync. To find data that supports the hypothesis (or not) we can look at the Active Session History (ASH) data to get a more granular view of the wait events.
Diving Into The ASHes With R
If the hypothesis is correct, the ASH data should show times where we observe long log file parallel write and long log file sync waits. One could write SQL against V$ACTIVE_SESSION_HISTORY to collect all the samples that may demonstrate the hypothesis — e.g. collect a list of sample ids for each event and examine them for overlap, but a visualization is worth a thousand words.
One way to get quick and easy visualizations from ASH data is to use R. R has become a very popular tool for those doing statistical analysis and it has some quite useful graphing and plotting packages built in. R can connect to Oracle via a JDBC package which makes importing data trivial.
Here is a plot that I put together using R for ~300 sample ids (~5 minutes) from ASH (recall that TIME_WAITED is in microseconds):
As you can see from the plots, nearly every time there is a long log file parallel write there are also numerous long log file sync events. This data supports the hypothesis.
Most any statistician (or anyone keen on numbers and stats) will tell you averages suck. This does not mean averages are not useful, but one has to keep in mind averages can hide infrequent outliers (which may make them a tad bit evil). For instance, AWR is reporting an average of 1ms for log file parallel write but we can see that there are several that are falling in the 20ms range in the 5 minute capture. If we zoom in on the graph (set the y axis at a max of 3ms), we can understand why:
Most of the log file parallel write events are coming in around 0.5ms so even with some 20ms outliers it still yields an average of 1ms, thus hiding the magnitude of impact for the outliers. This is why drilling down into the ASH data was important for us to understand the scope of the issue.
With More Data Comes More Questions
At this point, the visualization of ASH data shows a strong correlation between log file sync and log file parallel write outliers, but the root cause has not yet been identified. Perhaps there is more insight lurking in the ASH data?
Given this is an OLTP workload and log file parallel write is an IO operation, perhaps it’s worth looking at another IO operation like, say, db file sequential read – single block IO times.
In the below plot, I’ve taken a 60 second window of ASH data and plotted all the db file sequential read events.
Interestingly, there are several samples where the IOs are taking significantly longer than most — the majority are below 10ms but we see a few samples where there are groups in the 20ms or more range. Let’s add the log file sync events to the plot.
Hmmm… even more interesting. The data is showing a correlation between log file sync and db file sequential read. Any bets on what the plot looks like if log file parallel write is added?
Very interesting. The data is showing us that all 3 events are correlated strongly. Clearly we are on to something here…and using R to plot the ASH data was an easy way to present and understand it visually.
While this blog post did not go into specific root cause, it was a healthy journey in debugging and data driven analysis. Keep in mind that just because AWR averages look good, don’t overlook the fact that the event histograms should also be reviewed to see if there are outliers. R allows an easy way to put scatter plots of event times together to get a nice visual of what is going on.
Here is the code I used to do the analysis. It’s pretty straight forward and well commented (I think). Enjoy!
I’ll also mention that I use the RStudio IDE for R.
The UK Oracle User Group Conference 2011 is just around the corner and I just realized that I haven’t yet highlighted this, other than the “I’m speaking” banner on my blog. I’ll be speaking on one of my favorite and most used reports — the SQL Monitor Report. Below are the session details [direct link]. Hope to see you there!
|Presentation Title:||A Deep Dive into the SQL Monitoring Report|
|Speaker:||Mr Greg Rahn|
|Presentation abstract:||The SQL Monitoring Report was introduced in Oracle Database 11g and has become the single most used diagnostic report by the Oracle Real-World Performance Group and Oracle database development for SQL statement performance issues. This session will start with a technical overview of the SQL Monitoring Report and what metrics and information it provides. From there we’ll deep dive into numerous examples from the field explaining how this report was used to diagnose and validate performance issues. There is a wealth of information in the SQL Monitoring Report and this session will provide the necessary knowledge to best leverage it.|
|Presentation begins:||05/12/2011 11:05|
|Presentation content level:||2 (1 = indepth, 5 = strategic overview)|
|Audience experience:||All experience levels|
|Audience function:||Product Expert, Product User|
|Speaker biography:||Greg Rahn is a database performance engineer in the Real-World Performance Group at Oracle Corporation. He joined the Real-World Performance Group in 2004 and has been working with Oracle databases since 1997. His primary focus is performance and scalability for Oracle data warehouses, specializing in Parallel Execution and Query Optimizer subject areas.|
|Related topics:||Oracle RDBMS: Features and Options, Oracle RDBMS: Performance Managment|
A few weeks back one of the Vertica developers put up a blog post on counting triangles in an undirected graph with reciprocal edges. The author was comparing the size of the data and the elapsed times to run this calculation on Hadoop and Vertica and put up the work on github and encouraged others: “do try this at home.” So I did.
Vertica draws attention to the fact that their compression brought the size of the 86,220,856 tuples down to 560MB in size, from a flat file size of 1,263,234,543 bytes resulting in around a 2.25X compression ratio. My first task was to load the data and see how Oracle’s Hybrid Columnar Compression would compare. Below is a graph of the sizes.
As you can see, Oracle’s default HCC query compression (query high) compresses the data over 2X more than Vertica and even HCC query low compression beats out Vertica’s compression number.
Query Elapsed Times
The closest gear I had to Vertica’s hardware was an Exadata X2-2 system — both use 2 socket, 12 core Westmere-EP nodes. While one may try to argue that Exadata may somehow influence the execution times, I’ll point out that I was using In-Memory Parallel Execution so no table data was even read from spinning disk or Exadata Flash Cache — it’s all memory resident in the database nodes’ buffer cache. This seems to be inline with how Vertica executed their tests though not explicitly stated (it’s a reasonable assertion).
After I loaded the data and gathered table stats, I fired off the exact same SQL query that Vertica used to count triangles to see how Oracle would compare. I ran the query on 1, 2 and 4 nodes just like Vertica. Below is a graph of the results.
As you can see, the elapsed times are reasonably close but overall in the favor of Oracle winning 2 of the 3 scale points as well as having a lower sum of the three executions: Vertica 519 seconds, Oracle 487 seconds — advantage Oracle of 32 seconds.
It Should Go Faster!
As a database performance engineer I was thinking to myself, “it really should go faster!” I took a few minutes to look over things to see what could make this perform better. You might think I was looking at parameters or something like that, but you would be wrong. After a few minutes of looking at the query and the execution plan it became obvious to me — it could go faster! I made a rather subtle change to the SQL query and reran my experiments. With the modified SQL query Oracle was now executing twice as fast on 1 node than Vertica was on 4 nodes. Also, on 4 nodes, the elapsed time came in at just 14 seconds, compared to the 97 seconds Vertica reported — a difference of almost 7X! Below are the combined results.
What’s The Go Fast Trick?
I was thinking a bit more about the problem at hand — we need to count vertices but not count them twice since they are reciprocal. Given that for any edge, it exists in both directions, the query can be structured like Vertica wrote it — doing the filtering with a join predicate like e1.source < e2.source to eliminate the duplicates or we can simply use a single table filter predicate like source < dest before the join takes place. One of the first things they taught me in query planning and optimization class was to filter early! That notation pays off big here because the early filter cuts the rows going into the first join as well as the output of the first join by a factor of 2 — 1.8 billion rows output vs. 3.6 billion. That’s a huge savings not only in the first join, but also in the second join as well.
Here is what my revised query looks like:
with e1 as (select * from edges where source < dest), e2 as (select * from edges where source < dest), e3 as (select * from edges where source > dest) select count(*) from e1 join e2 on (e1.dest = e2.source) join e3 on (e2.dest = e3.source) where e3.dest = e1.source
First, I’d like to thank the Vertica team for throwing the challenge out there and being kind enough to provide the data, code and their elapsed times. I always enjoy a challenge — especially one that I can improve upon. Now, I’m not going to throw any product marketing nonsense out there as that is certainly not my style (and there certainly is more than enough of that already), but rather I’ll just let the numbers do the talking. I’d also like to point out that this experiment was done without any structure other than the table. And in full disclosure, all of my SQL commands are available as well.
The other comment that I would make is that the new and improved execution times really make a mockery of the exercise when comparing to Hadoop MapReduce or Pig, but I would also mention that this test case is extremely favorable for parallel pipelined databases that can perform all in-memory operations and given the data set is so small, this is the obviously the case. Overall, in my opinion, a poor problem choice to compare the three technologies as it obviously (over) highlights the right tool for the job cliche.
Experiments performed on Oracle Database 188.8.131.52.
Github source code: https://gist.github.com/grahn/1289188
I’ve seen some posts on the blogosphere where people attempt to explain (or should I say guess) how Exadata Smart Flash Logging works and most of them are wrong. Hopefully this post will help clear up some the misconceptions out there.
The following is an excerpt from the paper entitled “Exadata Smart Flash Cache Features and the Oracle Exadata Database Machine” that goes into technical detail on the Exadata Smart Flash Logging feature.
Smart Flash Logging works as follows. When receiving a redo log write request, Exadata will do
parallel writes to the on-disk redo logs as well as a small amount of space reserved in the flash
hardware. When either of these writes has successfully completed the database will be
immediately notified of completion. If the disk drives hosting the logs experience slow response
times, then the Exadata Smart Flash Cache will provide a faster log write response time.
Conversely, if the Exadata Smart Flash Cache is temporarily experiencing slow response times
(e.g., due to wear leveling algorithms), then the disk drive will provide a faster response time.
Given the speed advantage the Exadata flash hardware has over disk drives, log writes should be
written to Exadata Smart Flash Cache, almost all of the time, resulting in very fast redo write
performance. This algorithm will significantly smooth out redo write response times and provide
overall better database performance.
The Exadata Smart Flash Cache is not used as a permanent store for redo data – it is just a
temporary store for the purpose of providing fast redo write response time. The Exadata Smart
Flash Cache is a cache for storing redo data until this data is safely written to disk. The Exadata
Storage Server comes with a substantial amount of flash storage. A small amount is allocated for
database logging and the remainder will be used for caching user data. The best practices and
configuration of redo log sizing, duplexing and mirroring do not change when using Exadata
Smart Flash Logging. Smart Flash Logging handles all crash and recovery scenarios without
requiring any additional or special administrator intervention beyond what would normally be
needed for recovery of the database from redo logs. From an end user perspective, the system
behaves in a completely transparent manner and the user need not be aware that flash is being
used as a temporary store for redo. The only behavioral difference will be consistently low
latencies for redo log writes.
By default, 512 MB of the Exadata flash is allocated to Smart Flash Logging. Relative to the 384
GB of flash in each Exadata cell this is an insignificant investment for a huge performance
benefit. This default allocation will be sufficient for most situations. Statistics are maintained to
indicate the number and frequency of redo writes serviced by flash and those that could not be
serviced, due to, for example, insufficient flash space being allocated for Smart Flash Logging.
For a database with a high redo generation rate, or when many databases are consolidated on to
one Exadata Database Machine, the size of the flash allocated to Smart Flash Logging may need
to be enlarged. In addition, for consolidated deployments, the Exadata I/O Resource Manager
(IORM) has been enhanced to enable or disable Smart Flash Logging for the different databases
running on the Database Machine, reserving flash for the most performance critical databases.
I was just watching John Rauser’s keynote “What is a Career in Big Data?” from last weeks Strata Conference New York and I have to say it’s an amazing talk. I would highly recommended it to anyone who does any type of data analysis, including any type of performance analysis.
I found many of the “critical skill” points John made to have a strong correlation to performance analysis work as well. Some quotations that really stand out to me:
“[writing]…it’s the first major difference between mediocrity and greatness.” [10:39]
“If it isn’t written down, it never happened…if your writing is so opaque that people can not understand your work, then you may as well never have never done it.” [10:50]
“If you take a skeptical attitude toward your analysis you’ll look just as hard for data that refutes your hypothesis as you will for data that confirms it. A skeptic attacks the same question from many different angles and dramatically increases their confidence in the results.” [11:40]
The main reason that I wanted to highlight the critical skills of writing and skepticism is I see (read) way too many blog posts from Oracle users that fail on both of those skills. The writing simply fails to clearly communicate the issue at hand, what analysis was done and what data was used to draw the conclusion. Many blog posts also are written without any level of skepticism–they simply “report” their findings and fail to question their own work for accuracy. I hope this talk inspires you to raise your bar when it comes to performance work. Enjoy!