What AWR isn’t telling you
October 16, 2015 Leave a comment
It is well-known that AWR, and Statspack before, take snapshots of V$ views (or rather the underlying objects) to produce the data in AWR.
It is also well-known that, when considering sql and its statistics in the shared pool, if something big hitting happens but the big-hitter is no longer in the shared pool by the time of the snapshot, then it can’t be recorded in your AWR picture of activity.
But like many things supposedly well-known, it can still come back and remind you that you’ve forgotten or overlooked this behaviour.
Here is a little illustration which happened earlier this week.
This week I’ve been looking closely at activity which happens on a particular DB between 16:00 and 17:00 because the IO subsystem is showing signs of stress and the timings for ‘Average Synchronous Single-Block Read Latency’ (V$SYSMETRIC_SUMMARY / DBA_HIST_SYSMETRIC_SUMMARY) have pushed above our amber warning levels.
Don’t get me started but our amber level for a single block read is 20ms!
Way too high for my liking for such sustained average but that is apparently the SLA from the SAN.
Why do we have such an alert?
The purpose is two-fold.
- It can sometimes be a warning that some of our IO-sensitive batch work might show some performance degradations, threatening SLAs
- In the past, it has been an effective warning that a sql statement executed concurrently in multiple threads from the applications has switched to a "bad" execution plan whose impact is system-wide (normally heavy concurrent direct path reads).
So… I was hovering around this system during the relevant timeframe looking at what was going on.
And I observed some heavy hitting queries running in parallel originating from another system (i.e. not my application).
There were two executions of one sql id (0cu2sm062jutc) executing concurrently with one single execution of a another sql statement.
The top-level RTSM execution metrics for these three executions were all very similar to this:
Executing for about 10 minutes, in parallel doing a lot of physical direct path reads (38GB) and accumulating over 1 hour in parallel processing time.
Status : DONE (ALL ROWS) Instance ID : 2 SQL ID : 0cu2sm062jutc SQL Execution ID : 33554432 Execution Started : 10/13/2015 16:03:35 First Refresh Time : 10/13/2015 16:03:37 Last Refresh Time : 10/13/2015 16:13:19 Duration : 584s Module/Action : JDBC Thin Client/- Program : JDBC Thin Client Fetch Calls : 257 Global Stats =================================================================================================================================== | Elapsed | Cpu | IO | Application | Concurrency | Cluster | Other | Fetch | Buffer | Read | Read | Write | Write | | Time(s) | Time(s) | Waits(s) | Waits(s) | Waits(s) | Waits(s) | Waits(s) | Calls | Gets | Reqs | Bytes | Reqs | Bytes | =================================================================================================================================== | 4041 | 252 | 3782 | 0.06 | 1.75 | 0.97 | 3.95 | 257 | 6M | 75051 | 38GB | 1070 | 125MB | ===================================================================================================================================
And then whilst I was fiddling with various queries against V$SQL the queries in question disappeared from V$SQL.
No big surprise of course, stuff is liable to be released from the library cache whenever it sees fit.
But then the remembrance penny hit home and I was left thinking that it was no wonder that I couldn’t see this regularly when looking retrospectively at this timeframe via AWR.
But this cumulative exection time above is larger than the number 3 SQL statement in AWR by elapsed time … so I’m not making this up. It’s significant.
This got me thinking – how can we try to identify heavy-hitting SQL which is being missed by AWR – ASH surely? – and what can we do to help – COLORED SQL?
Here is my first attempt at such an ASH query, trying to find queries which have taken longer than 5 minutes in a single execution in a single snapshot period but which aren’t in AWR.
SELECT * FROM (SELECT h.dbid , h.instance_number , h.snap_id , h.sql_id , h.sql_exec_id , h.sql_exec_start , h.qc_instance_id||'::'||h.qc_session_id||'::'||h.qc_session_serial# px_details , COUNT(*) , COUNT(DISTINCT h.session_id) --, MIN(h.snap_id) min_snap --, MAX(h.snap_id) max_snap , MIN(h.sample_time) , MAX(h.sample_time) , ((EXTRACT(DAY FROM (MAX(h.sample_time) - MIN(h.sample_time)))*86400) + (EXTRACT(HOUR FROM (MAX(h.sample_time) - MIN(h.sample_time)))*3600) + (EXTRACT(MINUTE FROM (MAX(h.sample_time) - MIN(h.sample_time)))*60) + (EXTRACT(SECOND FROM (MAX(h.sample_time) - MIN(h.sample_time))))) duration FROM dba_hist_active_sess_history h , dba_hist_snapshot s WHERE h.dbid = s.dbid AND h.snap_id = s.snap_id AND h.instance_number = s.instance_number AND sql_id IS NOT NULL AND sql_exec_id IS NOT NULL AND NOT EXISTS (SELECT 1 FROM dba_hist_sqlstat st WHERE st.dbid = h.dbid AND st.snap_id = h.snap_id --BETWEEN xxx.min_snap AND xxx.max_snap AND st.instance_number = h.instance_number AND st.sql_id = h.sql_id) GROUP BY h.dbid -- , h.snap_id -- , h.instance_number , h.sql_id , h.sql_exec_id , h.sql_exec_start , h.qc_instance_id , h.qc_session_id , h.qc_session_serial#) xxx WHERE duration > 600 ORDER BY snap_id DESC, duration DESC;
The main problem with this query is that DURATION does not tell us how long the query has been active in the database.
We could have a query which has been active all that time but was doing relatively little in the DB.
I had a quick look (again) at TM_DELTA_TIME and TM_DELTA_DB_TIME but they are still unreliable – in some of the rows it looks ok, in others not.
Let’s forget that and pretend that the SQL above works ok.
Perhaps we can add a limit WHERE the COUNT(*) > threshold because otherwise this is currently giving me 100s of statements in my production database.
That gives me a token 166 statements over the past 30 days.
And that’s a problem because my intention was to use this to identify SQL statements to add to AWR via ADD_COLORED_SQL.
And that has a limit of 100 statements.
Ah… but no, the other penny has just dropped whilst writing this…
If I add color this SQL, that will make no difference. That just means AWR will record that SQL if it’s there.
But if it’s not in the shared pool at snapshot time, it’s not in the shared pool so it’s not in AWR.
It’s almost as if we need a process which if a sql statement has breached a certain threshold then it won’t let it be flushed until MMON has written it to AWR.
No, that’s not right. That would present other problems.
I could shorten the AWR snapshot interval but I’m not convinced.
I think this is an oft-overlooked yet significant shortcoming.
So, for the moment at least, I’m left with what I think is a problem and I can’t think of a good solution… can you?
Or are you thinking mountain vs molehill?