Flippin’ Witch Hunt – adaptive direct path read
June 19, 2012 5 Comments
Plan flips… here seems to be general impression that most application performance issues are going to be caused by some sort of “plan flip” and that “plan flips” are bad.
- If you don’t use plan stability features, should you really expect plan stability?
- Even outside your stats job window whenever that is?
- What if you’re using bind variables with histograms?
- And you’ve got features like dynamic sampling which will probably get a slightly different sample for each hard parse?
- And if you’ve also got cardinality feedback kicking in?
The old battle between stability and optimisation, I suppose.
I made a throwaway comment on a forum thread about having seen a performance problem caused by a sql statement which changed to doing direct path IO.
I thought it might be vaguely interesting to do a quick blog post and back that up with a tiny bit of “evidence”.
This was getting on for three weeks ago now (29th May) and there were bigger fish to fry at the time so some of the evidence and some of the accuracy of the evidence might wont a little but I love having a long AWR retention and the outputs below are what ASH & AWR can tell us now about what happened back then.
Problem:
A particular sql statment was running slowly.
Initial analysis had identified a change in the plan and that change was caused by cardinality feedback and changes in plans are bad of course 😉
Looking at DBA_HIST_SQLSTAT, I could see that the SQL regularly flip-flopped between two plans, one of which was the plan normally received after a hard parse, the other a change brought on by cardinality feedback.
But there really wasn’t much to choose between the plans in terms of statistics and averages.
In the execution plan, we can also skip specifics, the only relevant detail that there was a FULL TABLE SCAN which was the inner part (i.e. executed multiple times for each row in the driving / outer rowsource) of a NESTED LOOP – something which is rarely going to be a good thing, i.e.
NESTED LOOP some row source FULL TABLE SCAN
However, if we cross-reference DBA_HIST_ACTIVE_SESS_HISTORY, then we see the critical differences between the executions of this FULL TABLE SCAN on the 28th / 29th May and other days.
If you don’t like averages then I don’t blame you but you won’t like this query:
with ash as (select trunc(ash.sample_time) snap_day , min(ash.snap_id) min_snap_id , max(ash.snap_id) max_snap_id , ash.current_obj# obj_id , ash.sql_id , ash.sql_plan_hash_value , ash.sql_plan_operation , ash.sql_plan_options , ash.current_obj# , ash.event , count(distinct ash.sql_exec_id) ash_cnt from dba_hist_active_sess_history ash where ash.sql_id = 'drbkfkfxg6a6a' and ash.sql_plan_operation = 'TABLE ACCESS' and ash.sql_plan_options = 'FULL' and ash.current_obj# = 169748 and ash.sample_time >= to_date('01/05/2012','DD/MM/YYYY') and ash.event is not null group by trunc(ash.sample_time) , ash.sql_plan_hash_value , ash.sql_id , ash.sql_plan_operation , ash.sql_plan_options , ash.current_obj# , ash.event) select ash.snap_day , ash.event , ash.ash_cnt , (select round(sum(elapsed_time_delta) / sum(executions_delta) /1000/1000) from dba_hist_sqlstat st where st.sql_id = ash.sql_id and st.snap_id between ash.min_snap_id and ash.max_snap_id) avg_elp , (select round(sum(rows_processed_delta) / sum(executions_delta)) from dba_hist_sqlstat st where st.sql_id = ash.sql_id and st.snap_id between ash.min_snap_id and ash.max_snap_id) avg_rws , round((1275068416 - (select sum(space_allocated_delta) from dba_hist_seg_stat st where st.obj# = ash.obj_id and st.snap_id >= ash.min_snap_id)) / 1024/1024) "SPACE_USED (ISH)" , (select min(bytes)/1024/1024 from dba_hist_sgastat st where name = 'buffer_cache' and st.snap_id between ash.min_snap_id and ash.max_snap_id) min_bc from ash order by snap_day;
What am I trying to see:
- Historic running of a particular sql statment
- Focusing on this FULL TABLE SCAN operation which was a problem on the 28th / 29th
- Occurences in ASH (be very careful of reading too much into samples of samples and aggregates and averages)
- What the statement tended to wait on and a feel for how much waiting (beware of sampled samples)
- And finally a rough approximation of how big the object was relative to the buffer cache (we know what size it is now (which I’ve plugged into the query) so work back through the growth in segment stats. It’s using space allocated because I don’t have access to dbms_space to get space used and space allocated is going to be a high estimate of actual space used – it certainly shouldn’t be more than that)
SNAP_DAY EVENT ASH_CNT AVG_ELP SPACE_USED (ISH) MIN_BC --------- -------------------------------- ---------- ---------- ---------------- ------- 01-MAY-12 db file scattered read 1 19 1152 61696 01-MAY-12 db file sequential read 1 19 1152 61696 01-MAY-12 db file sequential read 1 25 1152 61696 02-MAY-12 db file sequential read 1 8 1152 61696 02-MAY-12 db file scattered read 1 8 1152 61696 07-MAY-12 db file scattered read 1 7 1152 61696 08-MAY-12 db file scattered read 1 24 1152 61696 08-MAY-12 db file sequential read 1 24 1152 61696 09-MAY-12 db file sequential read 1 100 1152 61696 09-MAY-12 db file scattered read 1 100 1152 61696 10-MAY-12 db file sequential read 1 6 1152 61696 10-MAY-12 db file scattered read 1 6 1152 61696 11-MAY-12 db file sequential read 1 7 1152 61696 11-MAY-12 db file scattered read 1 13 1152 61696 11-MAY-12 db file sequential read 1 13 1152 61696 15-MAY-12 db file scattered read 1 148 1152 62208 16-MAY-12 db file scattered read 1 8 1152 62464 16-MAY-12 db file sequential read 2 5 1152 62464 17-MAY-12 db file scattered read 1 17 1152 62208 17-MAY-12 db file sequential read 1 17 1152 62208 18-MAY-12 db file sequential read 2 17 1152 62208 18-MAY-12 db file scattered read 2 17 1152 62208 21-MAY-12 db file scattered read 1 16 1216 62208 21-MAY-12 db file scattered read 1 6 1216 62208 22-MAY-12 db file scattered read 1 19 1216 62208 23-MAY-12 db file scattered read 3 17 1216 62208 23-MAY-12 read by other session 1 15 1216 62208 24-MAY-12 db file scattered read 1 28 1216 61696 25-MAY-12 db file scattered read 1 67 1216 62208 25-MAY-12 db file sequential read 1 67 1216 62208 28-MAY-12 direct path read 3 45 1216 62464 28-MAY-12 direct path read 96 70 1216 62464 29-MAY-12 direct path read 10 484 1216 62464 29-MAY-12 direct path read 137 643 1216 62464 29-MAY-12 enq: KO - fast object checkpoint 1 545 1216 62464 35 rows selected. SQL>
The FULL TABLE SCAN as the outer rowsource of a NESTED LOOP is a performance threat anyway but in summary, a performance problem, irrelevant plan stability and a key switch to direct path IO, not obviously caused by a significant growth in object size.
If I’ve clearly understood the bottom line of this post you are saying that without any change in explain plan and without any significant change in an object segment size nor in the buffer cache space availability, a flip of a FULL segment access from a db file scattered read to a direct path read might be a cause of a performance pain.
Is it right?
If so, then the explication I have to the case you’ve posted here is that for each FULL segment operation decided by the OPTIMIZER (either FULL TABLE SCAN or INDEX FAST FULL SCAN) a db file sequential read is done on the segment header of the corresponding object to be full scanned. The purpose of this db file sequential read is to get the number of blocks to be read compared to the number of block below the high water mark. Based on this segment header db file sequential read, a decision (not a CBO decision) will be made of either to use a scattered read or a direct path read in order to FULL SCAN this segment.
However, when this segment header information is already in the buffer cache, there will be no db file sequential read to decide about the manner the segment will be FULL SCANNED. The existing information will be re-used for that purpose.
What do you think if for, subsequent select, this segment header information is not found in the buffer cache, forcing a new segment header db file sequential read, which at this time, will present different information about the number of block below the high water mark? Will this explain the change from scattered read to direct path read?
By the way, it’s strange to see in the list of event a db file sequential read for FULL TABLE ACCESS
Wasn’t this db file sequential read in your list, referring the db file segment header db file sequential read I am speaking about?
It comes just before the direct path read
Best Regards
Hi Mohamed,
Thanks for the comment.
Your summary of the bottom line is correct.
I think we have to be very careful about reading in too much here because this is DBA_HIST_ACTIVE_SESS_HISTORY, a sample of a sample AND aggregations of multiple executions.
An extended SQL Trace of both cases would have been much more informative.
So, I think this is something definitely worthy of further investigation in an isolated test case.
I appreciate that the information posted above is a little incomplete. I thought it was interesting so it’s a bit of a dump and run in the absence of any time to reproduce the issue.
One of the reasons that I didn’t look at this particular issue deeper at the time was that there were much bigger performance issues as a result of which it’s highly likely that significantly less of the table was cached when we switched to direct path.
So I believe that there is an adaptive element to the adaptive serial direct path read which takes into account how much of the object is cached which is along the lines of what you’re suggesting.
Cheers,
Dominic
Hi Dom,
I took something completely different from this (probably because I was doing my usual thing of scanning instead of really reading the details!). Your statement around the FTS being within a NESTED LOOP rarely being a good thing, no argument with that, but it got me thinking. If we know that certain combinations of operations in a plan are likely to be poor performers would it be interesting to write something that goes looking for these ‘possible issues’ proactively?
It could be a massive waste of time of course, and there are loads of other ways to find the ‘worst’ performing SQL etc, but just because a particular piece of SQL runs longer than others doesn’t necessarily mean that it’s less performant. Maybe it’s doing a hell of a lot more. So then that ‘Top SQL’ list isn’t quite as useful as it first appears.
Just a random thought. Maybe I’ll try it myself.
Cheers
Graham
Graham – I don’t think it would be ever be a waste of time, write once and use again and again. Doesn’t sound too tricky and it only needs to identify one statment once and it’s effort earned. You just need to identify a plan where the parent operation is a nested loop and the second child operation is some sort of full segment scan.
As you say, regarding top SQL just because there is statement at #1 doesn’t mean it’s a problem. As long as #1 is normally #1 it’s probably ok?
In the same way, just because a SQL statement isn’t in the top N (the statement in the article above was not), doesn’t mean it’s not a problem.
Yeah, I was thinking of almost an execution plan ‘anti-pattern’ search (for want of a better phrase). Not necessarily limited to just this one example, but probably throw in a few more that you’d want to be aware of. Cartesian joins seems like an obvious one.