What AWR isn’t telling you

It is well-known that AWR, and Statspack before, take snapshots of V$ views (or rather the underlying objects) to produce the data in AWR.

It is also well-known that, when considering sql and its statistics in the shared pool, if something big hitting happens but the big-hitter is no longer in the shared pool by the time of the snapshot, then it can’t be recorded in your AWR picture of activity.

But like many things supposedly well-known, it can still come back and remind you that you’ve forgotten or overlooked this behaviour.

Here is a little illustration which happened earlier this week.

This week I’ve been looking closely at activity which happens on a particular DB between 16:00 and 17:00 because the IO subsystem is showing signs of stress and the timings for ‘Average Synchronous Single-Block Read Latency’ (V$SYSMETRIC_SUMMARY / DBA_HIST_SYSMETRIC_SUMMARY) have pushed above our amber warning levels.

Don’t get me started but our amber level for a single block read is 20ms!
Way too high for my liking for such sustained average but that is apparently the SLA from the SAN.

Why do we have such an alert?
The purpose is two-fold.

  1. It can sometimes be a warning that some of our IO-sensitive batch work might show some performance degradations, threatening SLAs
  2. In the past, it has been an effective warning that a sql statement executed concurrently in multiple threads from the applications has switched to a "bad" execution plan whose impact is system-wide (normally heavy concurrent direct path reads).

So… I was hovering around this system during the relevant timeframe looking at what was going on.

And I observed some heavy hitting queries running in parallel originating from another system (i.e. not my application).

There were two executions of one sql id (0cu2sm062jutc) executing concurrently with one single execution of a another sql statement.

The top-level RTSM execution metrics for these three executions were all very similar to this:

Executing for about 10 minutes, in parallel doing a lot of physical direct path reads (38GB) and accumulating over 1 hour in parallel processing time.

 Status              :  DONE (ALL ROWS)          
 Instance ID         :  2                        
 SQL ID              :  0cu2sm062jutc            
 SQL Execution ID    :  33554432                 
 Execution Started   :  10/13/2015 16:03:35      
 First Refresh Time  :  10/13/2015 16:03:37      
 Last Refresh Time   :  10/13/2015 16:13:19      
 Duration            :  584s                     
 Module/Action       :  JDBC Thin Client/-       
 Program             :  JDBC Thin Client         
 Fetch Calls         :  257                      

Global Stats
===================================================================================================================================
| Elapsed |   Cpu   |    IO    | Application | Concurrency | Cluster  |  Other   | Fetch | Buffer | Read  | Read  | Write | Write |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   |  Waits(s)   | Waits(s) | Waits(s) | Calls |  Gets  | Reqs  | Bytes | Reqs  | Bytes |
===================================================================================================================================
|    4041 |     252 |     3782 |        0.06 |        1.75 |     0.97 |     3.95 |   257 |     6M | 75051 |  38GB |  1070 | 125MB |
===================================================================================================================================

And then whilst I was fiddling with various queries against V$SQL the queries in question disappeared from V$SQL.

No big surprise of course, stuff is liable to be released from the library cache whenever it sees fit.

But then the remembrance penny hit home and I was left thinking that it was no wonder that I couldn’t see this regularly when looking retrospectively at this timeframe via AWR.

But this cumulative exection time above is larger than the number 3 SQL statement in AWR by elapsed time … so I’m not making this up. It’s significant.

This got me thinking – how can we try to identify heavy-hitting SQL which is being missed by AWR – ASH surely? – and what can we do to help – COLORED SQL?

Here is my first attempt at such an ASH query, trying to find queries which have taken longer than 5 minutes in a single execution in a single snapshot period but which aren’t in AWR.

SELECT * 
FROM   (SELECT h.dbid
        ,      h.instance_number
        ,      h.snap_id
        ,      h.sql_id
        ,      h.sql_exec_id
        ,      h.sql_exec_start
        ,      h.qc_instance_id||'::'||h.qc_session_id||'::'||h.qc_session_serial# px_details
        ,      COUNT(*)
        ,      COUNT(DISTINCT h.session_id)
        --,      MIN(h.snap_id) min_snap
        --,      MAX(h.snap_id) max_snap
        ,      MIN(h.sample_time)
        ,      MAX(h.sample_time)
        ,    ((EXTRACT(DAY    FROM (MAX(h.sample_time) - MIN(h.sample_time)))*86400)
            + (EXTRACT(HOUR   FROM (MAX(h.sample_time) - MIN(h.sample_time)))*3600)
            + (EXTRACT(MINUTE FROM (MAX(h.sample_time) - MIN(h.sample_time)))*60)
            + (EXTRACT(SECOND FROM (MAX(h.sample_time) - MIN(h.sample_time))))) duration
        FROM   dba_hist_active_sess_history h
        ,      dba_hist_snapshot            s
        WHERE  h.dbid            = s.dbid
        AND    h.snap_id         = s.snap_id
        AND    h.instance_number = s.instance_number
        AND    sql_id           IS NOT NULL
        AND    sql_exec_id      IS NOT NULL
        AND    NOT EXISTS (SELECT 1
                           FROM   dba_hist_sqlstat st
                           WHERE  st.dbid            = h.dbid
                           AND    st.snap_id         = h.snap_id --BETWEEN xxx.min_snap AND xxx.max_snap
                           AND    st.instance_number = h.instance_number
                           AND    st.sql_id          = h.sql_id)
        GROUP BY 
               h.dbid
        --
        ,      h.snap_id
        --
        ,      h.instance_number
        ,      h.sql_id
        ,      h.sql_exec_id
        ,      h.sql_exec_start
        ,      h.qc_instance_id
        ,      h.qc_session_id
        ,      h.qc_session_serial#) xxx
WHERE  duration > 600
ORDER BY snap_id DESC, duration DESC;

The main problem with this query is that DURATION does not tell us how long the query has been active in the database.

We could have a query which has been active all that time but was doing relatively little in the DB.

I had a quick look (again) at TM_DELTA_TIME and TM_DELTA_DB_TIME but they are still unreliable – in some of the rows it looks ok, in others not.

Let’s forget that and pretend that the SQL above works ok.

Perhaps we can add a limit WHERE the COUNT(*) > threshold because otherwise this is currently giving me 100s of statements in my production database.

That gives me a token 166 statements over the past 30 days.

And that’s a problem because my intention was to use this to identify SQL statements to add to AWR via ADD_COLORED_SQL.
And that has a limit of 100 statements.

Ah… but no, the other penny has just dropped whilst writing this…

If I add color this SQL, that will make no difference. That just means AWR will record that SQL if it’s there.

But if it’s not in the shared pool at snapshot time, it’s not in the shared pool so it’s not in AWR.

Duh!

It’s almost as if we need a process which if a sql statement has breached a certain threshold then it won’t let it be flushed until MMON has written it to AWR.

No, that’s not right. That would present other problems.

I could shorten the AWR snapshot interval but I’m not convinced.

I think this is an oft-overlooked yet significant shortcoming.

So, for the moment at least, I’m left with what I think is a problem and I can’t think of a good solution… can you?

Or are you thinking mountain vs molehill?

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: