This is a follow-up to a database performance problem which happened a couple of weeks back, first briefly mentioned in this previous article of mine.
You’ve probably heard it a thousand times before and there’s no significant variance in this case nor revelation in resolution:
- The Business have reported a performance problem.
- Some reports which normally take a couple of minutes have been taking closer to one hour.
- Some DBAs / managers / developers know that when this problem has happened previously, if they flushed the shared pool, the problem mysteriously went away – so that’s what they’ve already done.
The first area of comment is the flushing of the shared pool.
Ok, so it’s a bit of a case of using a sledgehammer to crack a nut but the business have deadlines. With production performance problems and business deadlines, sometimes you’ve got no choice.
Then again, nobody knew it would work – fortunately, from some perspectives, it did work. But, until the root cause is identified and fixed, it’s always going to be uncertain. And “flush the shared pool” is always going to be the battle cry of the panicked, like “reboot your PC”.
Clearly, this is a system that can cope with a flushing of the shared pool. Not true of many.
So, the shared pool has been flushed and we’ve thrown the baby – the evidence & some of the causes – out with the bath water.
However, if you’re appropriately licensed on 10g onwards, you’ve got AWR and ASH to fall back on.
Quickly though, how long should your AWR retention be?
The default is 7 days.
But, as Doug points out in one of his comments, in some organisations by the time you’ve got access five or six days might have passed.
35 and 42 days are oft-mentioned sensible suggestions, mostly so that a) you’ve got a decent time buffer to react and b) you’ve got significant cover to allow for most business cycles of activity.
It’s set to 7 here which is a bit of pain.
Fortunately, I got access to the data within a couple of hours of requesting an account. But even so, I was working on other stuff at the same time and would occasionally think of things to double check but it was too late – the AWR and ASH data was long gone (also I’m writing this article about two weeks after I originally intended to and I wanted to post the queries and data from the actual problem but what with the disruption of the snow and Christmas, the data to match the queries has gone).
Anyway, you’ve been told something has been slow, the problem’s over now, the pool’s flushed, but people want an explanation – there’s an invASHtigation to be done.
In general, when I get access to production, I get data dictionary access. No access to underlying objects, no access to GUIs.
This is part of what I call a blind investigation.
I’m new to the system so I don’t know much about it, I’m not familiar with the SQL or objects, etc.
But I might have looked at some Statspack or AWR reports to get a general picture of activity.
DBA_HIST_ACTIVE_SESS_HISTORY is nearly always my starting place, or V$ACTIVE_SESSION_HISTORY if the data has not been flushed yet, to look for non-standard waiting.
First up though, you need to have asked some basic questions:
- Is/Was the problem system-wide or limited to a few sessions?
- If limited, limited to who? Connecting as which user? Connecting via which application?
- What was the timeframe of the problem
If the problem is system-wide, identifying it from looking at the ASH data for particular sessions might be tricky.
If the database is not the source of the problem, then using AWR and ASH might not give any clues.
It’s not terribly scientific, but paging quickly through and aggregating the ASH data can often give a good “feel” for whether the database was experiencing any non-standard extra time in any particular timeframe (although comparing system-level pictures of activity like statspack and AWR reports might be better).
In this case, I opened up with a query a bit like this, targetting a small timeframe when the problem was almost certainly to have been occurring for the end-user:
select to_char(sample_time,'HH24:MI:SS') time, session_id sid
, session_serial# s#, user_id usr, session_state state
, time_waited waited, blocking_session blk_by, blocking_session_status blk_stat
, sql_id, sql_child_number child_id, sql_plan_hash_value hash
, sql_plan_operation plan_op
, event, p1text, p1, p2text, p2, p3text p3, wait_class
, top_level_sql_id top_id, sql_exec_start ex_start
, plsql_entry_object_id pl_obj, plsql_entry_subprogram_id pl_prog
where sample_time between <start_time> and <end_time>
order by time desc;
Then adding a predicate to run a similar query targetted on the particular user/s experiencing the problem:
and user_id = <user_id>
Followed by zooming in on a particular session if one stands out in the data above:
and session_id = <session_id>
and session_serial# = <serial#>
With this particular performance problem, it was quickly obvious that this was a case of good plan gone bad with the affected sessions not blocked or waiting but mostly ON CPU running the same SQL_ID and the same SQL_PLAN_HASH_VALUE.
By changing the query to look at counts of a particular hash, it might be possible to find if it’s a particular plan hash that is causing a problem and when it came along:
select sql_plan_hash_value, COUNT(*), MIN(sql_exec_start), MAX(sql_exec_start)
where sql_id = <sql_id>
group by sql_plan_hash_value;
This query revealed a big skew towards a particular plan which had first appeared in the sampled data at 6 in the morning that day.
Another view which may reinforce your thinking here is DBA_HIST_SQL_STAT. For example, you can get a rough idea of the relative work of one plan over another using a query like this:
select plan_hash_value, optimizer_cost, optimizer_mode, optimizer_env_hash_value, buffer_gets_delta/executions_delta
where sql_id = <sql_id>
and executions_delta > 0;
The plans were gone out of memory but there’s a handy function to pull the plan from AWR:
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_AWR(<sql_id>,<plan_hash_value>,<db_id>,<format>));
Which works really well as long as you don’t hit the display issues you can get with large plans.
If you do hit these display issues then you might prefer to pull your plans from dba_hist_sql_plan for example using a variation of Tanel’s script here.
lpad(' ',depth*1,' ')||operation || ' ' || options xms_plan_step,
plan_hash_value = <plan_hash>
order by id
One big, big downside of pulling the plan from AWR and not from v$sql_plan is that the access_predicates and filter_predicates columns are not populated. The columns are there but the information has not been copied across. A big oversight in the implementation surely.
Having got a bad plan and some not-so-bad plans (same sql id, different plan hash values), then we can start to look at the differences and what might have caused them.
I’m having trouble with uploading and/or presenting these plans in a way that works. WordPress won’t let me upload text files, the plans are 3000-odd lines long and the various options that I’ve tried so far do not give a satisfactory result.
Actual plans aside, suffice to say that there are significant differences between good and bad plans with the bad plans generally employing full table scans against a big table rather than index lookups but there can be an awful lot of distractions in a big plan.
So … so far… good plan, bad plan, same sql, obvious differences. Why? Where else can we look?
There’s one more place which might help – DBA_HIST_SQLBIND.
This should have captured the peeked binds used when our bad plan was parsed along with any binds captured at regular intervals. Unfortunately, we can only tie in with the SQL_ID as there’s no sql plan hash value here so we’re really looking at LAST_CAPTURED in conjunction with the dates/times from the other views.
So. In conclusion…
The details are a bit vague and I apologise for this. I took too long before starting this article and ended up with a good illustration of why 7 days is too short for your AWR retention. However, AWR & ASH can be a very good way of identifying the causes of plan changes.
In this case, it seems highly likely that a particular bit of sql got a bad plan.
It might not have been bad for the particular combination of binds which were peeked at at parse time, but at least bad for the majority of uses that day.
There is, sadly, still just an element of theory about the causes of this particular performance problem.
I can see the bad plan. I can see the plans which work better in that the performance is acceptable for everyone (better does not necessarily mean good). I can see the binds which produced the better plans. I can see the binds which produced the band plan. But, as yet, running the query with the binds which produced the bad plan is not producing the bad plan :(.
And that’s the outstanding problem – if you can’t reproduce something at will, how do you know if you’ve fixed something? Can you prove you’ve understood the whole problem? You can’t know. Not for sure.
I hope to figure out the best way to upload these plans in another article.
This particular problem is almost certainly a case of bind variable peeking – arguably where bind variables shouldn’t be used. On a reporting system where the reports typically take some minutes, the overhead of hard parsing shouldn’t really be a consideration – this seems to be a system where the mantra “you must use bind variables” has been applied inappropriately. The reports take in date parameters which might be historic date ranges or might just be the latest month – a case of skew, a case of one plan does not suit all, a case for literals.
Any ideas, thoughts, mistakes spotted, how to post big plans, let me know.
Some other ASH / AWR resources:
Tracking the bind variable – Dion Cho
Doug’s AWR tagged posts
Doug’s ASH tagged posts
Tracking Parameter Changes – Kerry Osborne