April 25, 2007 Leave a comment
Today a problem with the application “hanging”.
The background is that there has been some development in an application against database A.
Database A is now out of action due to a disk controller failure of something akin to that and so the application is now pointing at database B.
When pointing at database A, the application was performing acceptably, not so when pointing at database B where it is “hanging”.
So, straightaway, in response to one of the questions we should ask ourselves when there is a sudden problem – what has changed? – we know that there has been a change in database and the immediate follow-up is what might the differences be between database A and database B.
Fortunately, one of the differences is already known to me and, given the symptoms, is jumping up and down, waving its hands and shouting ‘Look at me’.
Both database A and database B are relatively recent copies of the production database.
All 8.1.7 by the way.
However, the production database has a bad mix of some tables having statistics (some stale) and some having none which can lead the CBO to make some inappropriate decisions due to the difference between default values it uses for those objects without statistics and the data distribution reality.
Database A had had a statistics update subsequent to the refresh.
Database B had not.
Therefore it is highly likely that the symptom of the application hanging is a less than optimal plan on the SQL concerned.
For the application operation that is hanging, quite a few bits of SQL might be issued.
There might be more than one bit of SQL that is performing unacceptably but in any event, we need to work through each suboptimal performer as we find it.
The application mostly uses ORM to generate its lovely queries (although in this case the development that has occurred is to optimise some of the iterative, slow-by-slow dynamic SQL that is generated).
Therefore the best mechanism is to trace the database session concerned, replicate the “hanging” behaviour and see what it is doing. What I am expecting to find is that, on “hanging”, the last piece of SQL in the trace file will be doing more IO than it needs to.
So, a trigger is required which sets the extended trace event for a specific user:
CREATE OR REPLACE TRIGGER xxxxxx
AFTER LOGON ON DATABASE
IF USER = 'yyyyy'
EXECUTE IMMEDIATE 'ALTER SESSION SET events ''10046 trace name context forever, level 12''';
The next step is to reproduce the “hanging” behaviour via the application.
Sure enough this generates a trace file on the server which is doing many, many “db file sequential reads”.
More than normal? (where normal is the behaviour when it performs acceptably)
Don’t know for sure yet, but it’s a good bet.
The SQL in question is something like this:
FROM News t0
, (SELECT DISTINCT T1.newsid
FROM nolindex T2
, IMPORTERS T3
, NewsIndex T1
WHERE T3.destindexid = :1
AND T1.sectionid = T2.indexid
AND T2.indexid = T3.sourceindexid) v1
WHERE t0.ID = v1.newsid
The trace file is incomplete because, due to all the IO, it reached its size limit. Also the application was killed because we had the information we needed.
On database B (some stats but not all), running the SQL with a SET AUTOTRACE TRACE EXPLAIN reveals the following:
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=216293 Card=59764261584 Bytes=196445127826608)
1 0 MERGE JOIN (Cost=216293 Card=59764261584 Bytes=196445127826608)
2 1 TABLE ACCESS (BY INDEX ROWID) OF 'NEWS' (Cost=826 Card=1515434 Bytes=4961530916)
3 2 INDEX (FULL SCAN) OF 'PKNEWSTABLE' (UNIQUE) (Cost=26 Card=1515434)
4 1 SORT (JOIN) (Cost=50339 Card=3943706 Bytes=51268178)
5 4 VIEW (Cost=165128 Card=3943706 Bytes=51268178)
6 5 SORT (UNIQUE) (Cost=165128 Card=3943706 Bytes=256340890)
7 6 NESTED LOOPS (Cost=24 Card=3943706 Bytes=256340890)
8 7 MERGE JOIN (CARTESIAN) (Cost=24 Card=2589433 Bytes=134650516)
9 8 TABLE ACCESS (BY INDEX ROWID) OF 'IMPORTERS' (Cost=2 Card=11 Bytes=286)
10 9 INDEX (RANGE SCAN) OF 'DESTINDEXID_IDX' (NON-UNIQUE) (Cost=1 Card=11)
11 8 SORT (JOIN) (Cost=22 Card=235403 Bytes=6120478)
12 11 INDEX (FAST FULL SCAN) OF 'NEWSINDEXSECTLANGNEWSINDEX' (NON-UNIQUE) (Cost=2 Card=235403 Bytes=6120478)
13 7 INDEX (UNIQUE SCAN) OF 'NOLINDEX_PK' (UNIQUE)
There are some big numbers in there.
Next step is to run against database C which should be similar to database A before it died – i.e. refresh from production with statistics on more objects:
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=206 Card=89 Bytes=26700)
1 0 NESTED LOOPS (Cost=206 Card=89 Bytes=26700)
2 1 VIEW (Cost=28 Card=89 Bytes=1157)
3 2 SORT (UNIQUE) (Cost=28 Card=89 Bytes=2314)
4 3 NESTED LOOPS (Cost=26 Card=89 Bytes=2314)
5 4 NESTED LOOPS (Cost=2 Card=8 Bytes=120)
6 5 TABLE ACCESS (FULL) OF 'IMPORTERS' (Cost=2 Card=8 Bytes=80)
7 5 INDEX (UNIQUE SCAN) OF 'NOLINDEX_PK' (UNIQUE)
8 4 INDEX (RANGE SCAN) OF 'NEWSINDEXSECTLANGNEWSINDEX' (NON-UNIQUE) (Cost=3 Card=225099 Bytes=2476089)
9 1 TABLE ACCESS (BY INDEX ROWID) OF 'NEWS' (Cost=2 Card=4695495 Bytes=1347607065)
10 9 INDEX (UNIQUE SCAN) OF 'PKNEWSTABLE' (UNIQUE) (Cost=1 Card=4695495)
Some much smaller numbers and completely different access paths.
The numbers themselves are not an issue – it’s the different access paths that is the problem.
What I mean is that if, via hints for example, we forced the second access path on the poorly performing database, explain plan would still have much bigger numbers.
And so, for possible resolutions, we have a number of options, at least four, possibly more:
Option numero uno: Gather appropriate statistics on all objects so that the CBO has a decent chance.
Option numero the next one: Get rid of the partial statistics so that we have none and go back to using RBO.
Option three: Hint the SQL in the application.
Option four: Gather a stored outline for the query.
Options one and two can be discounted due to organisational “fear” of change. I would say that option one – all objects with appropriate statistics – is much preferable to option two. However, neither is a go-er due to fear that this might have a wider detrimental effect. (Incidentally, option one was attempted previously but the deployment was backed out because, according to rumour, the performance of something somewhere degraded)
Option three might be ok. The downside from an application perspective is that to put the hints in requires an application deployment, an application stop and start, and to take it out requires the same. As part of a wider strategy for this particular application, its suitability is questionable because it’s ORM generated code and therefore it’s probably not possible to control the hinting.
That leaves option four – capture a stored outline for the query on the “good” database, export, import to the “bad” database(s) and enable usage thereof. Might require some shared pool flushage but easily turn-on-and-offable, can be repeated easily for other problematic queries.
And just for completeness, if we were going for option four, the steps would be to first create the stored outline on the “good” database:
CREATE OR REPLACE OUTLINE <outline_name>
FOR CATEGORY <category_name>
(Note that the text of the sql statement must match exactly that that will be executed by the application.)
Next, export the outline from the “good database”, e.g.
exp userid=outln/ query="where category = ''<category_name>" tables=(ol$,ol$hints)
Then, import into the “bad” database, e.g.
imp userid=outln/ full=y ignore=yes
Finally, maybe use a trigger to enable the usage of the stored outline for the problem application user:
CREATE OR REPLACE TRIGGER xxxxxxxxx
AFTER LOGON ON DATABASE
IF USER = yyyyyyyyyyyyy
EXECUTE IMMEDIATE 'ALTER SESSION SET USE_STORED_OUTLINES = <category_name> ;
EXCEPTION WHEN OTHERS THEN NULL;