Visualizing Active Session History (ASH) Data With R

One of the easiest ways to understand something is to see a visualization. Looking at Active Session History (ASH) data is no exception and I’ll dive into how to do so with R and how I used R plots to visually present a problem and confirm a hypothesis. But first some background…

Background

Frequently DBAs use the Automatic Workload Repository (AWR) as an entry point for troubleshooting performance problems and in this case the adventure started the same way. In the AWR report Top 5 Timed Foreground Events, the log file sync event was showing up as the #3 event. This needed deeper investigation as often times the cause for longer log file sync times is related to longer log file parallel write times.

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
log file sync                     3,155,253       9,197      3    6.4 Commit

Drilling into this a bit deeper the two log file events reported the following in the Foreground Wait Events and Background Wait Events sections of the AWR report:

Foreground Wait Events               
-> s  - second, ms - millisecond -    1000th of a second
-> Only events with Total Wait Time (s) >= .001 are shown
-> ordered by wait time desc, waits desc (idle events last)
-> %Timeouts: value of 0 indicates value was  ordered by wait time desc, waits desc (idle events last)
-> Only events with Total Wait Time (s) >= .001 are shown
-> %Timeouts: value of 0 indicates value was < .5%.  Value of null is truly 0

                                                             Avg
                                        %Time Total Wait    wait    Waits   % bg
Event                             Waits -outs   Time (s)    (ms)     /txn   time
-------------------------- ------------ ----- ---------- ------- -------- ------
log file parallel write         659,157     0        526       1      0.2   18.2

It is generally always worth looking at the breakdown of these times, as they are averages. We really want to understand the entire histogram of these wait events. For that we can look at the Wait Event Histogram section of the AWR report as below.

Wait Event Histogram
-> Units for Total Waits column: K is 1000, M is 1000000, G is 1000000000
-> % of Waits: value of .0 indicates value was  % of Waits: column heading of <=1s is truly 1s is truly >=1024ms
-> Ordered by Event (idle events last)

                                                    % of Waits
                                 -----------------------------------------------
                           Total
Event                      Waits   <1ms  <2ms  <4ms  <8ms <16ms <32ms  1s
-------------------------- ------ ----- ----- ----- ----- ----- ----- ----- -----
log file parallel write    661.6K  84.7   9.7   4.7    .4    .1    .3    .0
log file sync              3138.K  14.0  42.8  30.4   7.9   2.3   2.6    .1

Wait Event Histogram Detail (64 msec to 2 sec)
-> Units for Total Waits column: K is 1000, M is 1000000, G is 1000000000
-> Units for % of Total Waits:
   ms is milliseconds
   s is 1024 milliseconds (approximately 1 second)
-> % of Total Waits: total waits for all wait classes, including Idle
-> % of Total Waits: value of .0 indicates value was  Ordered by Event (only non-idle events are displayed)

                                                 % of Total Waits
                                 -----------------------------------------------
                           Waits
                           64ms
Event                      to 2s <32ms <64ms <1/8s <1/4s <1/2s   <1s   =2s
-------------------------- ----- ----- ----- ----- ----- ----- ----- ----- -----
log file parallel write       52 100.0    .0    .0
log file sync               3535  99.9    .1    .0

One thing that you should notice here is there are two sections of Wait Event Histogram; the buckets less than 32ms and buckets greater than 32ms. It is also important to note that only the absence of a value means no timings fell into that bucket — so even though the report shows .0 there are still events in that bucket (read the section description as it mentions this).

We can see from the second histogram section that there were 52 times that log file parallel write was over 64ms as well as 3535 times log file sync was over 64ms. At this point a hypothesis is formed that the two events are correlated — that is, the belief is the long log file parallel write events may be causing the long log file sync. To find data that supports the hypothesis (or not) we can look at the Active Session History (ASH) data to get a more granular view of the wait events.

Diving Into The ASHes With R

If the hypothesis is correct, the ASH data should show times where we observe long log file parallel write and long log file sync waits. One could write SQL against V$ACTIVE_SESSION_HISTORY to collect all the samples that may demonstrate the hypothesis — e.g. collect a list of sample ids for each event and examine them for overlap, but a visualization is worth a thousand words.

One way to get quick and easy visualizations from ASH data is to use R. R has become a very popular tool for those doing statistical analysis and it has some quite useful graphing and plotting packages built in. R can connect to Oracle via a JDBC package which makes importing data trivial.

Here is a plot that I put together using R for ~300 sample ids (~5 minutes) from ASH (recall that TIME_WAITED is in microseconds):

Rplot01

As you can see from the plots, nearly every time there is a long log file parallel write there are also numerous long log file sync events. This data supports the hypothesis.

Averages Suck

Most any statistician (or anyone keen on numbers and stats) will tell you averages suck. This does not mean averages are not useful, but one has to keep in mind averages can hide infrequent outliers (which may make them a tad bit evil). For instance, AWR is reporting an average of 1ms for log file parallel write but we can see that there are several that are falling in the 20ms range in the 5 minute capture. If we zoom in on the graph (set the y axis at a max of 3ms), we can understand why:
Rplot02

Most of the log file parallel write events are coming in around 0.5ms so even with some 20ms outliers it still yields an average of 1ms, thus hiding the magnitude of impact for the outliers. This is why drilling down into the ASH data was important for us to understand the scope of the issue.

With More Data Comes More Questions

At this point, the visualization of ASH data shows a strong correlation between log file sync and log file parallel write outliers, but the root cause has not yet been identified. Perhaps there is more insight lurking in the ASH data?

Given this is an OLTP workload and log file parallel write is an IO operation, perhaps it’s worth looking at another IO operation like, say, db file sequential read – single block IO times.

In the below plot, I’ve taken a 60 second window of ASH data and plotted all the db file sequential read events.

Rplot11
Interestingly, there are several samples where the IOs are taking significantly longer than most — the majority are below 10ms but we see a few samples where there are groups in the 20ms or more range. Let’s add the log file sync events to the plot.
Rplot12
Hmmm… even more interesting. The data is showing a correlation between log file sync and db file sequential read. Any bets on what the plot looks like if log file parallel write is added?

Rplot13

Very interesting. The data is showing us that all 3 events are correlated strongly. Clearly we are on to something here…and using R to plot the ASH data was an easy way to present and understand it visually.

Summary

While this blog post did not go into specific root cause, it was a healthy journey in debugging and data driven analysis. Keep in mind that just because AWR averages look good, don’t overlook the fact that the event histograms should also be reviewed to see if there are outliers. R allows an easy way to put scatter plots of event times together to get a nice visual of what is going on.

Source Code

Here is the code I used to do the analysis. It’s pretty straight forward and well commented (I think). Enjoy!

I’ll also mention that I use the RStudio IDE for R.

16 comments

  1. Marcin Przepiorowski

    Hi Greg,

    As far as I know time_waited from ASH is a value of last event of particular type in every ASH sample. If you will have long ‘log file parallel read’ (ex. 20 ms) followed by next shorter ‘log file parallel read’ (ex. 5 ms) in one same ASH sample – time_waited will be set to last captured event and it will be set to 5 ms. It can hide some long waits from your research but it can be detected by v$event_histogram to some extent.

    regards,
    Marcin

  2. Greg Rahn

    TIME_WAITED is unknown when an ASH sample captures an event. This is because you can’t know how long something waited until it is done waiting. After the wait is completed, ASH will “fix-up” the actual TIME_WAITED value for that event. The only time this is not true is if a wait spans more than one ASH sample (the wait is greater than one second, which should be uncommon), in which case only the most recent sample gets the fix-up. As a result, TIME_WAITED=0 before the fix-up, or the actual value after the fix-up.

  3. Marcin Przepiorowski

    Hi Greg,

    I agree that there is a fix up but it happen only to last event in sample – see entry on Kyle’s blog http://dboptimizer.com/2011/07/20/oracle-time-units-in-v-views/
    “TIME_WAITED – micro, only the last sample is fixed up, the others will have TIME_WAITED=0
    thanks to John Beresniewicz for this info.”

    It can be a lot of events finished in 1 sec but only one captured at sample time is added to ASH with time_waited – this is typical to every sampling method.

    It can be shown by testing ASH and v$event_histograms –
    last time_waited grater than 256 ms was at 21-DEC-11 10.08.18 according to ASH and
    has same entry in v$event_histogram but last entry with time_waited grater than 30 ms
    was at 21-DEC-11 10.43.24 according to ASH but it was captured via v$event_histogram at 21-DEC-11 10.45.41 and I was able to see that event timing between 1 and 32 appear in histogram view almost continuously while it is recorded not in every ASH sample.

    SQL > select sample_time, event, time_waited from v$active_session_history where event in ('log file parallel write') and time_waited > 30000 order by 1;
    
    SAMPLE_TIME               EVENT                       TIME_WAITED
    ------------------------- -------------------------- -----------
    ...
    21-DEC-11 10.03.03.053 AM log file parallel write          39473
    21-DEC-11 10.03.53.163 AM log file parallel write          91680
    21-DEC-11 10.07.52.599 AM log file parallel write          41722
    21-DEC-11 10.08.18.668 AM log file parallel write         333670
    21-DEC-11 10.11.41.038 AM log file parallel write          37655
    21-DEC-11 10.14.13.355 AM log file parallel write          42059
    21-DEC-11 10.16.45.620 AM log file parallel write          42398
    21-DEC-11 10.29.27.099 AM log file parallel write          33406
    21-DEC-11 10.35.47.833 AM log file parallel write          38666
    21-DEC-11 10.42.08.510 AM log file parallel write          40266
    21-DEC-11 10.43.24.671 AM log file parallel write          30188
    
    SQL > select WAIT_TIME_MILLI, WAIT_COUNT, LAST_UPDATE_TIME from v$event_histogram where event = 'log file parallel write';
    
    WAIT_TIME_MILLI WAIT_COUNT LAST_UPDATE_TIME
    --------------- ---------- -------------------------------------
                  1 1058090527 21-DEC-11 10.45.47.116600 AM +00:00
                  2   77869065 21-DEC-11 10.45.46.998460 AM +00:00
                  4   18893632 21-DEC-11 10.45.46.722916 AM +00:00
                  8    3858133 21-DEC-11 10.45.44.274108 AM +00:00
                 16    6466024 21-DEC-11 10.45.43.024767 AM +00:00
                 32    2323073 21-DEC-11 10.45.41.815964 AM +00:00
                 64    1557219 21-DEC-11 10.45.15.087870 AM +00:00
                128      34468 21-DEC-11 10.16.48.147529 AM +00:00
                256      18172 21-DEC-11 10.42.51.973893 AM +00:00
                512      12517 21-DEC-11 10.08.18.762433 AM +00:00
               1024        153 20-DEC-11 05.44.27.641221 AM +00:00
               2048          1 11-FEB-11 12.51.56.077582 AM +00:00
               4096          1 28-JUL-11 06.27.54.836614 AM +00:00
    

    regards,
    Marcin

  4. Marcin Przepiorowski

    Hi Greg,

    Thanks for your comment but I’m wondering how it work for subsecond event times.
    Let say that I’m doing a index read and get following output from trace file between two ASH snapshots

    - ASH snapshot 1
    WAIT #46938270163504: nam='db file sequential read' ela= 5959 file#=1 block#=57876 blocks=1 obj#=112 tim=1324578633072101
    WAIT #46938270163504: nam='db file sequential read' ela= 2963 file#=1 block#=34039 blocks=1 obj#=71 tim=1324578633075130
    WAIT #46938270163504: nam='db file sequential read' ela= 3968 file#=1 block#=56533 blocks=1 obj#=71 tim=1324578633079489
    - ASH snapshot 2
    

    What will be a TIME_WAITED value for a ASH snapshot 2 if it take place when a last event is WAITING mode ? It will be 0 at the beginning and then it will be fix-up to 3968, isn’t it ? But my point is that there was a longer wait time 5959 for first wait and it will be not recorded in ASH.

    regards,
    Marcin

  5. Greg Rahn

    This is correct and is worth pointing out: Since ASH samples, only events that are “in-flight” are captured and then fixed-up. There could be (and are) numerous events that start and finish between ASH samples that are not captured in ASH, however, their times are collected and increase the counters for the v$event_histogram buckets.

    To use a slightly clearer example we can show it like this for a session:

    db file sequential read - BEGIN
    db file sequential read - END (3ms)
    db file sequential read - BEGIN
    db file sequential read - END (20ms)
    db file sequential read - BEGIN
    ASH SNAPSHOT, TIME_WAITED=0
    db file sequential read - END (4ms)
    TIME_WAITED fix-up takes place, TIME_WAITED=4ms
    

    In this case the 3ms and 20ms events are not captured in ASH, but will be collected as part of v$event_histogram. The 4ms event will be captured since it was in-flight/active when ASH took its sample.

    The case I was discussing earlier when an event spans ASH snapshots would be like this:

    db file sequential read - BEGIN
    ASH SNAPSHOT [1], TIME_WAITED=0
    ASH SNAPSHOT [2], TIME_WAITED=0
    db file sequential read - END (1.2 seconds)
    TIME_WAITED fix-up takes place, TIME_WAITED[2]=1.2 seconds, TIME_WAITED[1]=0
    
  6. Greg Rahn

    I think you are misunderstanding Kyle and John’s comment, and I think I know why, but in short we are saying exactly the same thing. What is missing from Kyle’s page is the specifics — that being the granularity is at a session’s wait, not all sessions on the same wait. Or to put it another way, TIME_WAITED=0 for each session’s entry in ASH for that sample until the fix-up takes place for that session, however, if a given session’s wait spans more than one ASH sample, only the most recent sample of that session, gets the fix-up.

    My comment about a wait greater than a second (but say less than two seconds) is guaranteed to be in two ASH samples is incorrect — it’s guaranteed to be in at least one ASH sample. To guarantee it be in two ASH samples, it has to be greater than two seconds, but it could stil show up in two samples and be less than two seconds (but greater than one second) given it was in a sample moments after it started.

  7. GregG

    Hi,
    great blog.
    You have mentioned ‘Here is the code I used to do the analysis…’ and I dont see source file :). Sorry if missing something obvious.
    Regards
    GregG

  8. John Beresniewicz

    Nice pseudo-code for the sampling logic Greg, and your explanation is exactly right.

    When thinking about ASH data and what it means it is really important to remember that ASH is sampling SESSION ACTIVITY, not wait events. Sessions on CPU and in non-idle wait are considered active so sessions in these states are the ones sampled.

    The sampling is unbiased in the (wall clock) time dimension in that the sampler wakes up like clockwork and does its thing. However the sampling IS biased relative to the event TIME_WAITED dimension since longer non-idle waits have higher probability of being sampled. ASH does not try to sample the longest or shortest events since last sample, simply the current event at sample time, whatever it is.

    So you cannot in general do something like:

    SELECT event, MAX(time_waited) from V$ASH group by event;

    and think you are actually getting the maximum time_waited by event, since the event with max time_waited may not have been sampled.

  9. Dom Giles

    Hi Greg…

    Very nice. Since most of the stats we tend to look at our DB stats as opposed to business information it’s nice to see a use of R relevant to some of the work we do…

    Alternatively I’ve put together a slightly more dynamic database time monitor here…

    http://www.dominicgiles.com/dbtimeviewer.html

    Having said that… I’m drawn to your R approach…. not sure why.

    Great examples….

    Dom.

  10. Greg Rahn

    In my opinion, looking at performance data is much the same as looking at most any other data set these days. The “data science” term can be applied as well — I need to find my data source (say host or database and what metrics from either or both) and then massage them into whatever format I need for my analysis and then poke around. Doing things like outlier detection, correlation and data visualizations are just a few things I’ve done with R. I have a few more things in the pipeline so stay tuned!

  11. goran

    Hi Greg,

    thanks for great post and specially for explanation related to ASH samples.
    I had ‘confusion’ on results presented in v$event_histogram and v$active_session_history where I wasn’t eble to draw correlation.
    Finally you and John Beresniewicz cleard it up … many thanks for this.

    @Dominic:
    Great tools !
    I am currently using ASH Viewer tool (http://ashv.sourceforge.net) to quickly check ASH data

  12. Yuri van Buren

    Hi Greg,

    T think you have to read the following support note:
    Troubleshooting: log file sync’ Waits [ID 1376916.1]

    Basically it says that, the log file parallel write is part of the log file sync operation, so Yes they are correlated.

    If you’re log file parallel write is close to the log file sync times, the time spend is on IO writes. If log file parallel write is very small compared to log file sync you loose time in communication e.g. in doing many small commits;

    Regards,

    Yuri.

  13. Kyle Hailey

    Here is a free ASH viewer called ASHMON I wrote some years ago in TCL/TK (runs on windows or LINUX as a fat client).
    http://ashmasters.com/ash-tools/
    It will by default read v$active_session_history, but if you install S-ASH and log into ASHMON as the S-ASH owner, it will read S-ASH tables
    see
    http://ashmasters.com/ash-simulation/
    http://oracleprof.blogspot.com/2011/11/new-release-of-s-ash-v23.html
    SASH obviates the need for the diagnostics pack license and allows it to run on standard edition and version 9 (as well as probably 7 and 8 though I haven’t tried)

    For a pay for tool that is an ASH simulator (ie no diagnostics pack license, runs on v9 etc and runs on DB2, Sybase and SQL Server) there is “DB Optimizer” from Embarcadero. Cool thing is it saves the data to flat file that can be mailed and shared between people analyzing the data. (not to mention DB Optimizer comes with Visual SQL Tuning (VST).

    For another pay for tool lab128 is super fast and cheaper than DB Optimizer but only works on Oracle and has no sql tuning but it’s fast. Did I say it’s fast?

    Currently I’m interested in web interfaces to data, so set up a jquery and highcharts version here:
    http://dboptimizer.com/2011/10/31/w-ash-web-enabled-ash/

    – Kyle Hailey

  14. Greg Rahn

    Yuri-

    I’m aware of this, as it was stated in the first paragraph: “often times the cause for longer log file sync times is related to longer log file parallel write times.” Do note though, that the data collection and analysis was necessary to prove this was indeed that case because the entry point to troubleshooting this was the observation of long log file syncs.

    Thanks for the pointer to the note.

  15. Bertrand

    Hello Greg,

    FYI i created 4 R scripts that you may found useful:

    – To retrieve and visualize wait events from AWR and in real time:

    http://bdrouvot.wordpress.com/2013/06/04/retrieve-and-visualize-in-real-time-wait-events-metrics-with-r/

    and

    http://bdrouvot.wordpress.com/2013/03/26/retrieve-and-visualize-wait-events-metrics-from-awr-with-r/

    – To retrieve and visualize system stats from AWR and in real time:

    http://bdrouvot.wordpress.com/2013/06/05/retrieve-and-visualize-system-statistics-in-real-time-with-r/

    and

    http://bdrouvot.wordpress.com/2013/03/27/retrieve-and-visualize-system-statistics-metrics-from-awr-with-r/

    Rgds
    Bertrand

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s