Doing more for less…

January 26, 2010

… seems to mean doing less.

Over recent years, economic troubles, and focus on the bottom line, I’ve heard wide mention of the phrase “Doing More for Less.”

Everywhere I look where people claim to be “doing more for less”, all I see is doing less.

(Although making sure you look busy while doing less seems to be the most important “quality”).

Has there ever been a time where so many people have been so focused on the short term, today’s budget, what just needs to be done “for this requirement only”?

Or so little care about whether it will be still standing tomorrow, whether it’s good value, or the right way to do something?

It’s a checkbox world.

Box ticked? Check! Move on to the next one!

In my day-to-day IT role, I seem to have regressed recently to only having one line changes to make.

One line changes which take ages to test.

Is it the cheapest change to make for the requirement?

Check!

Are we slowly making so many cheap, tactical one line changes such that system just becomes too convoluted, the cost of testing changes vastly outweighs the cost of making the code change and it starts to look cheaper to start again than to make any more changes?

Check!

But I’m not just talking about IT.

After the recent cold snap in the UK, there have been potholes in the roads everywhere.

Last week a couple of the biggest, most alloy-threatening down my lane had suddenly been filled in.

Had they been filled in properly?

Of course not.

Was it a proper repair, properly sealed that stood a chance of lasting another bout of the cold stuff?

Of course not.

Had it just been filled and compacted in the cheapest way to meet the immediate requirement? Hole filled?

Check!

But I can push the edges of the repair with my finger and it starts to crumble!

At the gym this week, I was chatting to one of the instructors.

The local place had a reputation of having one of the best line-ups of instructors in Europe requiring several years of experience, sports science degrees, etc.

She was bemoaning the cost-cutting which had resulted in several very experience and motivating instructors leaving and the general trend of moving away from instructors with significant years of experience and sports science education towards using far less experienced instructors working to prescripted classes detailing the precise times to shout specific motivating phrases.

You could go to any of these guys and their wealth of broad knowledge would be evident when telling you what was probably causing a specific problem or suggesting a change to a specific routine, etc.

Same number of classes?

Check! (but shorter)

Short-termist?

Of course.

This years budget reduced?

Check.

Are punters leaving because of the poorer service?

Check!

Politics has always been notoriously short-termist. But in the UK so many of the actions are targeted towards just winning the next, imminent election, not doing what’s best for the deficit, the long-term future of the country, etc.

And just look at the state of people’s pensions with the actions on dividend tax relief when Labour first came to power, etc.

Tax raised?

Check!

Long-term financial security and state independence undermined?

Who cares.

I don’t like talking about politics. But what about the deployments of underequipped armed forces? Defence budgets cut? Check! Armies deployed? Check!

Look at the big banks. Saved by taxpayers billions? Check! Now inappropriately making billions of short-term profits based on QE and paying huge bonuses? Check! Move along! This isn’t the contrition and long term thinking that you were looking for.

I used to think that so many of the things which annoyed me were IT specific and the result of, for example, poor interpretations/implementations of “Agile.” (Mind you before that I was incredibly frustrated by the blind adherence to Waterfall – Analysis done? Check!)

But I see so many tangents in so many unrelated areas.

And everywhere I look (not just IT) there seems to be an abundance of managers (Managing? Check!) and a void of leaders.

It’s a funny old world.

Where are the last bastions of doing things properly?


Problem plans

January 15, 2010

I was looking at a performance problem a week or so before Christmas which I touched on in previous posts recently.

These are big plans and I had trouble uploading them in a decent format for this blog.

The format isn’t great but I’ve finally put them up on on one of these pastebin things for now:

As to how long they remain available, we’ll see.

These are probably of limited interest to anyone.

But they do illustrate a couple of things.

1. There are a lot of distractions in a big plan.

These are big plans at around 2800 lines.

You can see a lot of the deadly duo – MERGE JOIN CARTESIAN and BUFFER SORT operations- but these themselves are not the cause of the performance degradation (see point 2 about relative goodness).

It’s not until you get to lines six to seven hundred that you start to see any symptom of the performance difference.

The main performance problems stem from the full table scans of POSITIONS (e.g. See Id 717 in Bad Plan) and the HASH JOIN (Id 605) rather than the INDEX ACCESS (Id 719 in Good plan) and NESTED LOOP (605/6) operations, operations which are repeated throughout the rest of the plan

2. Good can be relative

I don’t think anyone would look at the “good” plan and be comfortable with it.

Then again could you could ever look at that sort of length of plan and think “Yes!” ?

The point is that the business were happy with the report performance that this plan provided and in the current business environment there’s rarely appetite to change things just because they’re not good technically.

3. It’s difficult to hint adequately

There are a handful of hints in the source code which are presumably intended to offer some sort of plan stability.

However, with a plan of this size, a couple of USE_NL hints are not going to be adequate.

As Jonathan Lewis points out, it’s very difficult to hint well/completely.

There are other ways to achieve plan stability if that’s what you want.

In this case, those handful of hints have probably contributed as much to plan instability as anything else.

4. Reusability is not always a good thing

These plans come from a reporting system which is doing too much calculation dynamically.

There is a core of Views on Views on Pipelined Table Functions on top of more Views which aims to provide a reusable interface to all the information being reported on and the various reports on top.

It’s not unusual to get performance problems from that sort of setup and probably one of the reasons that a handful of hints were used in the first place.

I would nearly always favour performance over reusability.

P.S. These plans also show the wrapping issue mentioned previously. At the time of writing, I still have an open Oracle SR for that. They had asked for full dumps of tables, statistics, etc. However I reproduced it with a standalone query joining dba_objects 50 times.


Up and Down Like an Up and Downy Thing

January 12, 2010

I am totally fed up with the poor performance and lack of availability of the Oracle Documentation.

It’s been like this for weeks.

Never a good advert for a technology company.

The upgrade and subsequent performance of Metalink/Oracle Support got a lot of press but this hasn’t.

Maybe that’s due to the vast array of good alternatives like Morgan’s library, Oracle-base, etc.

But damn it bugs me.


Dynamic Sampling and Partitioning

January 7, 2010

A question on OTN asked about the interaction between dynamic sampling and partitioning, what difference does partitioning make to level 2 dynamic sampling?

To recap on dynamic sampling levels, level 2 says that dynamic sampling should be applied to all unanalysed tables with the number of sampled blocks being twice the default (32).

Asif Momen and Randolf Geist have two interesting studies showing how different variations on table, partition and subpartition statistics affect whether dynamic sampling is used at all – i.e. partitioning introduces a qualification/variation to that definition above of all unanalysed tables and affects whether dynamic sampling is eligible.

The other element to the question is, the above aside, once we’re in a situation where dynamic sampling is valid and eligible, is there any effect of partitioning on dynamic sampling? I.e. does dynamic sampling do anything special for a partitioned table?

The question is answered by looking at a 10046 trace which reveals how dynamic sampling does it’s stuff.

Taking, if he’ll excuse me, the table from Randolf’s post and the first query:


  1  CREATE TABLE wr_test
  2  ( test_id
  3  , trade_date
  4  , CONSTRAINT test_pk PRIMARY KEY (trade_date, test_id) using index local)
  5  PARTITION BY RANGE (trade_date)
  6  ( PARTITION p_jan VALUES LESS THAN (DATE '2009-02-01')
  7  , PARTITION p_feb VALUES LESS THAN (DATE '2009-03-01')
  8  , PARTITION p_mar VALUES LESS THAN (DATE '2009-04-01') )
  9  AS
 10  SELECT ROWNUM AS test_id
 11  , DATE '2009-02-01' + trunc(dbms_random.value(0, 59)) as trade_date
 12  FROM dual
 13* CONNECT BY LEVEL  <=1000
/

Table created.

SQL>alter session set events '10046 trace name context forever, level 12';

Session altered.

SQL> explain plan for
  2  select * from wr_test where trade_date = date '2009-03-01';

Explained.

Then looking at the trace file (handy hint):

SQL> l
  1  select value ||'/'||(select instance_name from v$instance) ||'_ora_'||
  2	    (select spid||case when traceid is not null then '_'||traceid else null end
  3	     from   v$process
  4	     where  addr = (select paddr
  5			    from   v$session
  6			    where  sid = (select sid
  7					  from	 v$mystat
  8					  where  rownum = 1))) || '.trc' tracefile
  9  from  v$parameter
 10* where name = 'user_dump_dest'
SQL> /

TRACEFILE
--------------------------------------------------------------------------------
/opt/oracle/admin/DOM10gR2/udump/DOM10gR2_ora_428.trc

SQL> 

If we run the trace file through TKPROF for readability then we can find the following queries from the dynamic sampling:

SELECT /* OPT_DYN_SAMP */ /*+ ALL_ROWS IGNORE_WHERE_CLAUSE
  NO_PARALLEL(SAMPLESUB) opt_param('parallel_execution_enabled', 'false')
  NO_PARALLEL_INDEX(SAMPLESUB) NO_SQL_TUNE */ NVL(SUM(C1),0), NVL(SUM(C2),0)
FROM
 (SELECT /*+ IGNORE_WHERE_CLAUSE NO_PARALLEL("WR_TEST") FULL("WR_TEST")
  NO_PARALLEL_INDEX("WR_TEST") */ 1 AS C1, CASE WHEN "WR_TEST"."TRADE_DATE"=
  TO_DATE(' 2009-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') THEN 1 ELSE 0 END
  AS C2 FROM "WR_TEST" "WR_TEST" WHERE "WR_TEST"."TRADE_DATE"=TO_DATE('
  2009-03-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) SAMPLESUB

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        1      0.00       0.00          0          4          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        3      0.00       0.00          0          4          0           1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: SYS   (recursive depth: 1)

Rows     Row Source Operation
-------  ---------------------------------------------------
      1  SORT AGGREGATE (cr=4 pr=0 pw=0 time=503 us)
    542   PARTITION RANGE SINGLE PARTITION: 3 3 (cr=4 pr=0 pw=0 time=610 us)
    542    TABLE ACCESS FULL WR_TEST PARTITION: 3 3 (cr=4 pr=0 pw=0 time=64 us)

********************************************************************************

SELECT /* OPT_DYN_SAMP */ /*+ ALL_ROWS opt_param('parallel_execution_enabled',
   'false') NO_PARALLEL(SAMPLESUB) NO_PARALLEL_INDEX(SAMPLESUB) NO_SQL_TUNE
  */ NVL(SUM(C1),0), NVL(SUM(C2),0), NVL(SUM(C3),0)
FROM
 (SELECT /*+ NO_PARALLEL("WR_TEST") INDEX("WR_TEST" TEST_PK)
  NO_PARALLEL_INDEX("WR_TEST") */ 1 AS C1, 1 AS C2, 1 AS C3  FROM "WR_TEST"
  "WR_TEST" WHERE "WR_TEST"."TRADE_DATE"=TO_DATE(' 2009-03-01 00:00:00',
  'syyyy-mm-dd hh24:mi:ss') AND ROWNUM &lt;= 2500) SAMPLESUB

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        2      0.00       0.00          0          0          0           0
Execute      2      0.00       0.00          0          0          0           0
Fetch        2      0.00       0.01          2          4          0           2
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        6      0.00       0.02          2          4          0           2

Misses in library cache during parse: 2
Optimizer mode: ALL_ROWS
Parsing user id: SYS   (recursive depth: 1)

Rows     Row Source Operation
-------  ---------------------------------------------------
      1  SORT AGGREGATE (cr=2 pr=1 pw=0 time=15881 us)
     24   VIEW  (cr=2 pr=1 pw=0 time=15878 us)
     24    COUNT STOPKEY (cr=2 pr=1 pw=0 time=15874 us)
     24     PARTITION RANGE SINGLE PARTITION: 3 3 (cr=2 pr=1 pw=0 time=15849 us)
     24      INDEX RANGE SCAN TEST_PK PARTITION: 3 3 (cr=2 pr=1 pw=0 time=15821 us)(object id 51602)

and you can repeat the steps with an unpartitioned table, e.g.

SQL> ed
Wrote file afiedt.buf

  1  CREATE TABLE wr_test
  2  ( test_id
  3  , trade_date
  4  , CONSTRAINT test_pk PRIMARY KEY (trade_date, test_id) using index)
  5  AS
  6  SELECT ROWNUM AS test_id
  7  , DATE '2009-02-01' + trunc(dbms_random.value(0, 59)) as trade_date
  8  FROM dual
  9* connect by level  <= 1000
/

Table created.

The point being that the queries are the same for both the partitioned and unpartitioned table – there’s no special treatment – although the access plans for these recursive queries are going to be different due to the partitioning.

As an aside, what prevents us from entering an infinite loop of dynamic sampling of the dynamic sampling queries?


SQL_PLAN – other_xml

December 31, 2009

Just a quick follow-up to my previous post on investigating bad plans using AWR & ASH.

In that article, I mentioned

  1. that it was a shame that the ACCESS_PREDICATES and FILTER_PREDICATES columns were not populated (how this cannot be classified as a bug is beyond comprehension) and
  2. that I was using DBA_HIST_SQLBIND to investigate various peeked binds for that particular SQL based on the LAST_CAPTURED column

However, I should also have mentioned that, from 10gR2 onwards, populated in both V$SQL_PLAN and DBA_HIST_SQL_PLAN, there is a very useful column called OTHER_XML.

From the documentation:

OTHER_XML: “Provides extra information specific to an execution step of the execution plan. The content of this column is structured using XML because it allows multiple pieces of information to be stored, including the following:

  • Name of the schema against which the query was parsed.
  • Release number of the Oracle Database that produced the explain plan.
  • Hash value associated with the execution plan.
  • Name (if any) of the outline or the SQL profile used to build the execution plan.
  • Indication of whether or not dynamic sampling was used to produce the plan.
  • The outline data, a set of optimizer hints that can be used to regenerate the same plan.

Here is a sample of the peeked bind section from my sql with the problem plan:


  <peeked_binds>
    <bind nam=”:B3″ pos=”3″ dty=”2″ pre=”0″ scl=”0″ mxl=”22″>c3023102</bind>
    <bind nam=”:B2″ pos=”4″ dty=”1″ csi=”873″ frm=”1″ mxl=”32″>524d46</bind>
    <bind nam=”:B1″ pos=”5″ dty=”12″ mxl=”7″>786d0a1f010101</bind>
    <bind nam=”:B3″ pos=”13″ ppo=”3″ dty=”2″ pre=”0″ scl=”0″ mxl=”22″/>
    <bind nam=”:B2″ pos=”14″ ppo=”4″ dty=”1″ csi=”873″ frm=”1″ mxl=”32″/>
    <bind nam=”:B1″ pos=”15″ ppo=”5″ dty=”12″ mxl=”7″/>
    <bind nam=”:B3″ pos=”16″ ppo=”3″ dty=”2″ pre=”0″ scl=”0″ mxl=”22″/>
    <bind nam=”:B1″ pos=”19″ ppo=”5″ dty=”12″ mxl=”7″/>
    <bind nam=”:B3″ pos=”20″ ppo=”3″ dty=”2″ pre=”0″ scl=”0″ mxl=”22″/>
    <bind nam=”:B5″ pos=”21″ ppo=”1″ dty=”2″ pre=”0″ scl=”0″ mxl=”22″/>
    <bind nam=”:B3″ pos=”22″ ppo=”3″ dty=”2″ pre=”0″ scl=”0″ mxl=”22″/>
    <bind nam=”:B6″ pos=”23″ dty=”2″ pre=”0″ scl=”0″ mxl=”22″>c10d</bind>
    <bind nam=”:B1″ pos=”24″ ppo=”5″ dty=”12″ mxl=”7″/>
    <bind nam=”:B6″ pos=”26″ ppo=”23″ dty=”2″ pre=”0″ scl=”0″ mxl=”22″/>
    <bind nam=”:B1″ pos=”27″ ppo=”5″ dty=”12″ mxl=”7″/>
    <bind nam=”:B1″ pos=”29″ ppo=”5″ dty=”12″ mxl=”7″/>
    <bind nam=”:B3″ pos=”31″ ppo=”3″ dty=”2″ pre=”0″ scl=”0″ mxl=”22″/>
    <bind nam=”:B6″ pos=”32″ ppo=”23″ dty=”2″ pre=”0″ scl=”0″ mxl=”22″/>
    <bind nam=”:B1″ pos=”33″ ppo=”5″ dty=”12″ mxl=”7″/>
  </peeked_binds>

I’m not sure if the repetition of bind names and positions but not values is normal – one for further investigation.

But bind variable 1 is what I was talking about previously as my date bind with the data skew and we can convert the values back out of their raw value using, for example, the overloaded DBMS_STATS.CONVERT_RAW_VALUE.

So, from above, taking my B1 bind of “786d0a1f010101″:

declare
l_date date;
begin
dbms_stats.convert_raw_value(rawval => '786d0a1f010101', resval => l_date);
dbms_output.put_line('dt: '||to_char(l_date,'DD/MM/YYYY HH24:MI:SS'));
end;
/

dt: 31/10/2009 00:00:00

Some other posts regarding OTHER_XML:
Kerry Osborne using OTHER_XML to create test scripts with bind variables
Randolf Geist using OTHER_XML to get outlines / create profiles


Plans gone AWRy

December 29, 2009

This is a follow-up to a database performance problem which happened a couple of weeks back, first briefly mentioned in this previous article of mine.

You’ve probably heard it a thousand times before and there’s no significant variance in this case nor revelation in resolution:

  • The Business have reported a performance problem.
  • Some reports which normally take a couple of minutes have been taking closer to one hour.
  • Some DBAs / managers / developers know that when this problem has happened previously, if they flushed the shared pool, the problem mysteriously went away – so that’s what they’ve already done.

The first area of comment is the flushing of the shared pool.

Ok, so it’s a bit of a case of using a sledgehammer to crack a nut but the business have deadlines. With production performance problems and business deadlines, sometimes you’ve got no choice.

Then again, nobody knew it would work – fortunately, from some perspectives, it did work. But, until the root cause is identified and fixed, it’s always going to be uncertain. And “flush the shared pool” is always going to be the battle cry of the panicked, like “reboot your PC”.

Clearly, this is a system that can cope with a flushing of the shared pool. Not true of many.

So, the shared pool has been flushed and we’ve thrown the baby – the evidence & some of the causes – out with the bath water.

However, if you’re appropriately licensed on 10g onwards, you’ve got AWR and ASH to fall back on.

Quickly though, how long should your AWR retention be?

The default is 7 days.
But, as Doug points out in one of his comments, in some organisations by the time you’ve got access five or six days might have passed.

35 and 42 days are oft-mentioned sensible suggestions, mostly so that a) you’ve got a decent time buffer to react and b) you’ve got significant cover to allow for most business cycles of activity.

It’s set to 7 here which is a bit of pain.
Fortunately, I got access to the data within a couple of hours of requesting an account. But even so, I was working on other stuff at the same time and would occasionally think of things to double check but it was too late – the AWR and ASH data was long gone (also I’m writing this article about two weeks after I originally intended to and I wanted to post the queries and data from the actual problem but what with the disruption of the snow and Christmas, the data to match the queries has gone).

Anyway, you’ve been told something has been slow, the problem’s over now, the pool’s flushed, but people want an explanation – there’s an invASHtigation to be done.

In general, when I get access to production, I get data dictionary access. No access to underlying objects, no access to GUIs.

This is part of what I call a blind investigation.
I’m new to the system so I don’t know much about it, I’m not familiar with the SQL or objects, etc.
But I might have looked at some Statspack or AWR reports to get a general picture of activity.

DBA_HIST_ACTIVE_SESS_HISTORY is nearly always my starting place, or V$ACTIVE_SESSION_HISTORY if the data has not been flushed yet, to look for non-standard waiting.

First up though, you need to have asked some basic questions:

  • Is/Was the problem system-wide or limited to a few sessions?
  • If limited, limited to who? Connecting as which user? Connecting via which application?
  • What was the timeframe of the problem

If the problem is system-wide, identifying it from looking at the ASH data for particular sessions might be tricky.

If the database is not the source of the problem, then using AWR and ASH might not give any clues.

It’s not terribly scientific, but paging quickly through and aggregating the ASH data can often give a good “feel” for whether the database was experiencing any non-standard extra time in any particular timeframe (although comparing system-level pictures of activity like statspack and AWR reports might be better).

In this case, I opened up with a query a bit like this, targetting a small timeframe when the problem was almost certainly to have been occurring for the end-user:

select to_char(sample_time,'HH24:MI:SS') time, session_id sid
, session_serial# s#, user_id usr, session_state state
, time_waited waited, blocking_session blk_by, blocking_session_status blk_stat
, sql_id, sql_child_number child_id, sql_plan_hash_value hash
, sql_plan_operation plan_op
, event, p1text, p1, p2text, p2, p3text p3, wait_class
, top_level_sql_id top_id, sql_exec_start ex_start
, plsql_entry_object_id pl_obj, plsql_entry_subprogram_id pl_prog
from   dba_hist_active_sess_history
where sample_time between <start_time> and <end_time>
order by time desc;

Then adding a predicate to run a similar query targetted on the particular user/s experiencing the problem:

and user_id = <user_id>

Followed by zooming in on a particular session if one stands out in the data above:

and session_id = <session_id>
and  session_serial# = <serial#>

With this particular performance problem, it was quickly obvious that this was a case of good plan gone bad with the affected sessions not blocked or waiting but mostly ON CPU running the same SQL_ID and the same SQL_PLAN_HASH_VALUE.

By changing the query to look at counts of a particular hash, it might be possible to find if it’s a particular plan hash that is causing a problem and when it came along:

select sql_plan_hash_value, COUNT(*), MIN(sql_exec_start), MAX(sql_exec_start)
from   dba_hist_active_sess_history
where  sql_id = <sql_id>
group by sql_plan_hash_value;

This query revealed a big skew towards a particular plan which had first appeared in the sampled data at 6 in the morning that day.

Another view which may reinforce your thinking here is DBA_HIST_SQL_STAT. For example, you can get a rough idea of the relative work of one plan over another using a query like this:

select plan_hash_value, optimizer_cost, optimizer_mode, optimizer_env_hash_value, buffer_gets_delta/executions_delta
from   dba_hist_sqlstat
where sql_id = <sql_id>
and   executions_delta > 0;

The plans were gone out of memory but there’s a handy function to pull the plan from AWR:

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_AWR(<sql_id>,<plan_hash_value>,<db_id>,<format>));

Which works really well as long as you don’t hit the display issues you can get with large plans.

If you do hit these display issues then you might prefer to pull your plans from dba_hist_sql_plan for example using a variation of Tanel’s script here.

select
    id      xms_id,
    lpad(' ',depth*1,' ')||operation || ' ' || options xms_plan_step,
    object_name     xms_object_name,
    cost        xms_opt_cost,
    cardinality xms_opt_card,
    bytes       xms_opt_bytes,
    optimizer   xms_optimizer
from
    dba_hist_sql_plan
where
    plan_hash_value = <plan_hash>
order by id

One big, big downside of pulling the plan from AWR and not from v$sql_plan is that the access_predicates and filter_predicates columns are not populated. The columns are there but the information has not been copied across. A big oversight in the implementation surely.

Having got a bad plan and some not-so-bad plans (same sql id, different plan hash values), then we can start to look at the differences and what might have caused them.

I’m having trouble with uploading and/or presenting these plans in a way that works. WordPress won’t let me upload text files, the plans are 3000-odd lines long and the various options that I’ve tried so far do not give a satisfactory result.

Actual plans aside, suffice to say that there are significant differences between good and bad plans with the bad plans generally employing full table scans against a big table rather than index lookups but there can be an awful lot of distractions in a big plan.

So … so far… good plan, bad plan, same sql, obvious differences. Why? Where else can we look?

There’s one more place which might help – DBA_HIST_SQLBIND.
This should have captured the peeked binds used when our bad plan was parsed along with any binds captured at regular intervals. Unfortunately, we can only tie in with the SQL_ID as there’s no sql plan hash value here so we’re really looking at LAST_CAPTURED in conjunction with the dates/times from the other views.

So. In conclusion…

The details are a bit vague and I apologise for this. I took too long before starting this article and ended up with a good illustration of why 7 days is too short for your AWR retention. However, AWR & ASH can be a very good way of identifying the causes of plan changes.

In this case, it seems highly likely that a particular bit of sql got a bad plan.
It might not have been bad for the particular combination of binds which were peeked at at parse time, but at least bad for the majority of uses that day.

There is, sadly, still just an element of theory about the causes of this particular performance problem.
I can see the bad plan. I can see the plans which work better in that the performance is acceptable for everyone (better does not necessarily mean good). I can see the binds which produced the better plans. I can see the binds which produced the band plan. But, as yet, running the query with the binds which produced the bad plan is not producing the bad plan :( .

And that’s the outstanding problem – if you can’t reproduce something at will, how do you know if you’ve fixed something? Can you prove you’ve understood the whole problem? You can’t know. Not for sure.

I hope to figure out the best way to upload these plans in another article.

This particular problem is almost certainly a case of bind variable peeking – arguably where bind variables shouldn’t be used. On a reporting system where the reports typically take some minutes, the overhead of hard parsing shouldn’t really be a consideration – this seems to be a system where the mantra “you must use bind variables” has been applied inappropriately. The reports take in date parameters which might be historic date ranges or might just be the latest month – a case of skew, a case of one plan does not suit all, a case for literals.

Any ideas, thoughts, mistakes spotted, how to post big plans, let me know.

Some other ASH / AWR resources:
Tracking the bind variable – Dion Cho
Doug’s AWR tagged posts
Doug’s ASH tagged posts
Tracking Parameter Changes – Kerry Osborne


DBMS_XPLAN display issues

December 21, 2009

DBMS_XPLAN is a great little tool for getting formatted SQL execution plans and associated information. Good enough to make my revised top 3 yesterday.

Whether following an explain plan:

EXPLAIN PLAN FOR
SELECT ... FORM ...;

SELECT *
FROM TABLE(DBMS_XPLAN.DISPLAY);

Or pulling a plan from memory using:

SELECT *
FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(<sql_id>,<child_cursor>,<format>));

Typically used with the gather_plan_statistics hint like this:

SELECT /*+ gather_plan_statistics */ ....
FROM ...
WHERE...;

SELECT *
FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

Or pulling a plan still in memory by specifying the sql_id and, optionally, the child cursor.

For further information on the options for the third format parameter, here is a nice post on Rob van Wijk’s blog.

Jonathan Lewis also has some articles on dbms_xplan here and here.

Another handy option which I’ve been using recently is the ability to pull plans which are no longer in memory from the AWR repository using:

SELECT *
FROM TABLE(DBMS_XPLAN.DISPLAY_AWR(<sql_id>,<child_cursor>,<db_id>,<format>));

Whilst troubleshooting a recent performance problem where a good (all things are relative) plan had gone bad, the bad plan was no longer in memory. However, I was able to pull the bad plan from the AWR repository and also a number of variations on the good plan.

However, this was a big plan – some 3000 lines – and illustrates some display issues which you can get.

See this excerpt:


| 316 | E JOIN CARTESIAN                                        MERG |                                |    790 |
| 317 | IST ITERATOR                                             INL |                                |        |
| 318 | BLE ACCESS BY INDEX ROWID                                 TA | CLASSIFICATIONS                |      5 |
| 319 | NDEX RANGE SCAN                                            I | CLASSIFICATIONS_IX1            |      5 |
| 320 | FER SORT                                                 BUF |                                |    158 |
| 321 | RT UNIQUE                                                 SO |                                |    158 |
| 322 | NDEX FAST FULL SCAN                                        I | PK_ELEMENT_AGGREGATES          |    158 |
| 323 | E ACCESS FULL                                           TABL | PERFORMANCE_ELEMENTS           |   1863 |
| 324 |  UNIQUE SCAN                                           INDEX | PK_CLASSIFICATION_ELEMENTS     |      1 |
| 325 |  SCAN                                            INDEX RANGE | VEHICLE_METRICS_PK             |      1 |
| 326 | S BY INDEX ROWID                                 TABLE ACCES | POSITIONS                      |      1 |
| 327 | E SCAN                                            INDEX RANG | POSITIONS_UK                   |      1 |
| 328 |  INDEX ROWID                                 TABLE ACCESS BY | PRR_MAPPINGS                   |      1 |
| 329 | AN                                            INDEX RANGE SC | PRR_MAPPINGS_UX1               |      1 | 

The OUTPUT column is wrapped and does not make such investigations and interpretations any easier.
Note that this is NOT a SQL*Plus display issue as far as I can work out. It’s not the whole line which is wrapping.

As far as I know, there is no solution to this.
I have an Oracle SR open for confirmation but after 4 days I’ve yet to even receive confirmation of my ticket…

If there’s a solution, let me know.


Top 3 Oracle Features of the Decade

December 19, 2009

Yesterday I made an early break for the annual New Year resolutions post.

However, there’s another end-of-year post that’s up for consideration in this particular year and, rather than looking forward to the coming year, that’s a look back over the past 10, 2000-2010.

So, I ask you what are your top 3 Oracle RDBMS features over the past decade.

Oracle 8i was 1999 so we’re really talking about:

  1. 9iR1 in 2001
  2. 9iR2 in 2002
  3. 10gR1 in 2003
  4. 10gR2 in 2005
  5. 11gR1 in 2007
  6. 11gR2 in 2009

A Top 3 is very, very challenging when you think of some of the enhancements which have come along.

Obviously it depends on your own, personal perspective – naturally, mine are going to have a Development bias.

Each version fixes a significant number of bugs from the previous release, but I seem to remember 8i being a rush job for the buzzwords ahead of the new millenium (the early patches at least), 9iR1 a hatchet job and 9iR2 being a relative bastion of stability.

Then again it’s also a long time ago now and my memories may be unreliable.

The New Features Guides of 9i (links above) lists amongst others:

  • LogMiner improvements, DataGuard, RAC, Flashback Query, Some online redefinition & reorganisation features, VPD enhancements, Automatic Undo Management, dynamic memory management, spfles, RMAN improvements, native XML functionality and XMLDB, the MERGE statement, TIMESTAMPs and INTERVALs, CASE statement, External Tables, associative arrays/index-by tables indexed by VARCHAR2, Streams, CDC, Index Skip Scans, ANSI SQL, OMF, multiple blocksizes, dynamic sampling, table compression, subquery factoring, pipelined table functions, etc.

A list of new features from 10g (or my list at least) is shorter, but even so it seems a bigger hitter in terms of weighty marketing-savvy acronyms and features:

  • ADDM, ASH, ASM, AWR, Automatic SQL Tuning, DataPump, Job Scheduler, SQL Access Advisor, HTMLDB, Online Table Redefinition, Oracle XE, DBMS_XPLAN…

11g has always felt like 10gR3. Nevertheless, there have been still some impressive features therein:

  • Adaptive Cursor Sharing, Result Cache, Database Resident Connection Pool Caching, Invisible Indexes, SecureFiles, Binary XML…

And then recently Exadata and the Oracle Database Machine (although I think it’s too soon to make any judgement on these and if they have a big impact then it will be mostly in the next decade).

For the summaries above, I’ve missed out a lot (probably accidentally – let me know – but I was rushing towards the end).

But I’ve focused on the initial introductions of headline new features.

However, I would argue that the gradual evolution of some of the main features and functionality has had some of the biggest impact as we are now at the end of the decade compared to the end of 1999. In addition, some of the tweaks and internals have also had a big impact whilst avoiding the headlines – mutexes for example.

Think about how the CBO has moved on over the past ten years.

Or how HTMLDB has morphed and moved on to APEX.

Or how parallel capabilities have developed. Or partitoning.

Maybe you like your GUIs like OEM.

Or the evolution of IDEs has been revolutionary for you e.g. SQL Developer, or the explosion of functionality in third party tools like Toad or Quest Code Tester for Oracle.

Or the combination of several features – external tables plus pipelined functions are pretty cool for loading in data, for example. Or partitioning plus parallel.

All of these are up for consideration.

What’s not up for consideration is stuff that predates the decade. There are several features which I thought of while doing this and which turned out to be introduced in 8i or even before. Man, tempus fugit…

So, my Top 3 is based on what I do or use in Oracle most days (or even what I don’t have to do anymore) because of features introduced or evolved over the past decade. The evolution of collection functionality in SQL and PLSQL came close. A bit further behind was some of the XML capabilities. A whole host of ineligible features from older versions were scrubbed out. However, in no particular order, my vote goes to:

  • Analytics – It’s amazing how many time I end up using analytic functions. As Tom Kyte says “Analytics rock, Analytics roll”. Countless multipasses of data have been avoided with these babies.(Analytics ruled ineligible as an 8i introduction, dammit)
  • The WITH clause aka Subquery Factoring – Most of my sql, if it has any sort of complexity, ends up using this feature; fantastic for breaking up and understanding a complex bit of sql that you’ve never seen before; great at encouraging you to think in SETS.
  • AWR / ASH / Wait model evolution – Shame AWR & ASH are licensed separately and there are alternatives for earlier Oracle versions or if you don’t want to fork out the big bucks. But so useful for diagnosing performance problems, particularly retrospectively and/or identifying session problems which might have drowned previously in a system-level Statspack.
  • DBMS_XPLAN – Another feature that I use day in, day out. So easy to get explain plans, actual plans from memory or from AWR. Invaluable!

I would imagine that if you were a production DBA then the evolution of RMAN and Flashback has been fantastically useful and time saving.

These are not things that I’ve tended to use much in my role.

But if something’s revolutionised your decade, let me know.


The way things are

December 18, 2009

You wait for weeks for something to blog about then the basis for three or four articles come along at once.

Last week I was chatting by email to Chet, big bundle of enthusiasm that he is, about stuff, amongst other things, the lack of articles here.

I’ve not fallen off the blog wagon but truth is that for the past six months I’ve not really done any significant development work or performance tuning and that is what normally kicks off something to write about.

Plus I’m trying to avoid the sort of low-value, self-indulgent posts (like this one!) which tend to characterise this blog.

Due the economic downturn, the projects that I work on have gone into more of a maintenance mode so I’ve been somewhat under-utilised and have been doing more minor bug fixes, one line code changes and much, much less big development or tuning.

My heart is in development but small changes in no way float my boat.

It does make me think I should turn towards DBAdom and try and find DBA roles (and maybe it’s less roles but probably more company and departmental cultures) where the DBAs are involved welcomed into and are an integral part of the development process rather than find the sort of Development-focused roles that I’m struggling and have struggled to find recently.

The past couple of weeks I’ve been doing some work for a different team, giving me a change of perspective on the client’s business and data and giving everyone on the previous team a break from my attitude and tone (aggravated by my work frustrations).

No sooner have I moved over when the new team have got a SQL performance problem (not me, I haven’t touched anything).

The system is a model example of reusability and complexity, involving several of these:

  • View -> View -> View -> Pipelined Table function -> View -> View

(i.e. look through the object definitions and you will see that the top level views references several other views which are themselves based on views, in turn based on pipelined-table functions which are based on views of views)
(Reusability and complexity – or rather lack of complexity aka maintainability – can be two points on a trade-off triangle with the other being performance or cost)

The problem is a classic and I’m just working on the spin-off posts which touch (links to be updated shortly):

It’s the Friday before Christmas, I’m snowed in and working from home.

Is that a sled I see before me

I was just doing a half day (finished now of course) because:

  1. I find working remotely incredibly difficult without the two full screens I’ve become accustomed to, I have to remote onto my VM desktop at work where all my files are open via a less than full screen window, and half the open windows are on the second screen which I can’t see remotely;
  2. the kids are here because snow has closed their school;
  3. that sled in the photo is waiting for me.

I was thinking also that I might be the first to make a break for this year’s New Year Resolutions post two weeks early (snow’s made me go scatty)…

In the New Year, I aspire to (in no particular order):

  • Try to be a better person, parent, husband, friend and colleague.
  • Take more holidays.
  • Upgrade my certifications.
  • Find a more satisfying role.
  • Write better technical posts.
  • Write less low-value drivel.

Have a good one.


Loader

November 6, 2009

It’s been years and years since I worked regularly with SQL*Loader-based feeds.

There are loads of tips and tricks regarding formats, encodings, character sets, etc.
I had forgotten nearly everything that I could ever have claimed to know in this area, all aged out of my personal buffer cache, at best distant memories, very distant.

However, I’ve recently had to resurrect some of these distant memories to feed some data via External Tables.

I thought it would be worthwhile to do an incoherent brain dump on some of the particular issues for future reference.

This article is my no means comprehensive and will just touch on some of the considerations specific to what I was doing.

The starting place for documentation on SQL*Loader is the Oracle Database Utilies Manual and similarly for External Tables.

The thing about external tables is that they can use the ORACLE_LOADER or the ORACLE_DATAPUMP access drivers.

My circumstances involved the ORACLE_LOADER driver.

As the names suggest, ORACLE_LOADER is related to SQL*LOADER. In fact, you can use the EXTERNAL_TABLE=GENERATE_ONLY parameter with SQL*LOADER to generate the syntax for your EXTERNAL TABLE.

I think things like this are brilliant but I never use them. I suppose I’m old skool. Neat features that auto generate code mean that a) I don’t necessarily need to understand what’s going on and b) skip the line-by-line sanity checks that I still believe in.

I work in an environment where it’s not so easy to get access to servers to ftp and view files, etc.

As a result, my approach would be to work with SQL*Loader early on and then graduate to external files nearer the time. However, I’m going to skip all that, fast forward past the SQL*Loader bit and move on to the issues.

So the main considerations for loading up data?
- Encoding / Character sets
- Delimiters – field and line
- File Transfer

(I’m going to ignore Endianness for the sake of “clarity”. But it can be a factor. See Byte Ordering for futher information.)

In my experience, when you get errors whilst loading, it’s due to one or more of the reasons above.
More than one?
Sure.

For example, the combination of file transfer mechanism and delimiters.
Ever wondered why there’s confusion over whether you should transfer a file as text (or ascii), as binary? What’s the difference?

As ever, it depends.
- SFTP doesn’t have a text mode, all transfers are binary.
- FTP lets you choose between ascii (default) and binary.
- What about WinSCP and similar tools with their automatic mode?

Automatic modes usually work off the file extension and would pick a text transfer for HTML, TXT, PHP, XML, etc and binary otherwise.

And the differences in mode?

For Text mode, there are two basic methods – either the tool is responsible for doing some conversion to the format supported by the destination or, more commonly, the client converts to a canonical format and the service then does a further conversion to its own format if necessary.

With Binary mode, the raw bytes are transferred as is, i.e. the file is transferred in its original form.

So, the main significance of this related to field and line delimiters – things like tabs and line feed characters are different between platforms, different between Unix and Windows for example.

This was particularly relevant for my file export from SQL Server to Oracle on Linux.

In Windows, a new line is often represented by two characters – one carriage return and one line feed.
In Unix, a new line is normally just a line feed.

Sometimes you see “^M” characters on *nix. What’s this about?
This represents the carriage return part of the newline for Windows as described above.
If you need to, you can get rid of that using dos2unix.

So, in the ACCESS PARAMETERS subsection, if you use the “RECORDS DELIMITED BY NEWLINE” syntax in your external table definiton, what does that mean?

The NEWLINE keyword uses the newline format for your platform – so just a line feed in the case of *nix.

For my process, I decided that I would just go with the format as extracted from SQL Server – a 50:50 decision that there’s no point on expanding on.

So, in my situation, the syntax for an external table should not use the NEWLINE keyword because the format was Windows new lines, the syntax for which can be:

RECORDS DELIMITED BY '\r\n'

Here’s a situation. Things have been going fine in DEV,etc and eventually go to PREPRODUCTION and there’s a problem.

The feeds don’t work. The external table loader isn’t finding the right delimiters.

The question really is how can you tell what characters are in your file?
So, there’s no point looking at the file prior to transfer in case the technological clue (or the person in a manual process) changes the file on transfer. We need to look at it in the destination directory on the destination server.

And what’s the best way to do that?

On a Unix/Linux, a useful command is the od command which dumps files in octal and other formats. For example,

od -c <filename>

which gives ascii characters or backslash escapes.
There’s a lot of output from this command but you don’t need to do the whole file, e.g.

head -2 <filename> | od -x | more

If the file suddenly doesn’t match the expected format, what’s the sort of errors you might get?
Well, if it should have transfered in binary but it was done in text by error, you might get:

KUP-04020: found record longer than buffer size supported

i.e. the lines are running into each other
or under slightly different circumstances

KUP-04023: field start is after end of record

etc.

That’s about it on transfers and delimiters.

What’s there to say about encoding and character sets.
Using the od command above we can see the hex codes to double check the encoding is as expected.
Other that that, in our external table definition we can specify the character set in the ACCESS PARAMETERS section using the CHARACTERSET keyword, e.g.

CREATE TABLE <my_tablename>
(<my_columns>)
ORGANIZATION EXTERNAL
  (  TYPE ORACLE_LOADER
     DEFAULT DIRECTORY <my_directory>
     ACCESS PARAMETERS
       ( RECORDS DELIMITED BY '\r\n'
       CHARACTERSET AL16UTF16
       FIELDS TERMINATED BY '\t'
       MISSING FIELD VALUES ARE NULL  )
     LOCATION (<my_directory>:'<my_filename>')
  )
REJECT LIMIT 0;

and then the expected characterset conversion will take place.
Note that SQLServer tends to use USC2 as its unicode characterset and UCS2 is a subset of AL16UTF16.

It’s Friday, it’s 5 o’clock, I’ve lost focus. I’m done.