RAC – It’s all about the node affinity

Or what happens when you don’t have node affinity.

Here’s an old, short post that was lying around unpublished.

SQL executed on the 14th on node 2, on 15th on node 1.

DT SQL ID Node Rows Execs Elapsed Cluster Wait Time
15-NOV 7287jpw5vtm8j 1 4129697 7591 32625 32175
14-NOV 7287jpw5vtm8j 2 19703872 30909 78 60

Timings in seconds.

32625 seconds is just over 9 hours, of which 32175 is cluster wait time.

Database services and application datasource config have now been reconfigured to the extent that there is node affinity and the data required for the workload on node 1 is completely segregated from the workload on node 2.

If you use RAC as a black box and you have problems with performance or problems with predictability of performance then look into whether node affinity, or a lack thereof, is your problem.

WFG – mode 5 TM deadlock

Just a brief note to preserve some observations on a deadlock situation reported a while back on the OTN forums.

The symptoms of the problem were deadlocks on a RAC system and this example partial extract of a wait-for-graph from an LMD trace file:

Global Wait-For-Graph(WFG) at ddTS[0.5] :
BLOCKED 0x3d2438198 4 wq 2 cvtops x1 TM 0x1df87.0x0(ext 0x0,0x0)[41000-0001-000004A5] inst 1 
BLOCKER 0x3d7846108 4 wq 2 cvtops x1 TM 0x1df87.0x0(ext 0x0,0x0)[38000-0002-00002C68] inst 2 
BLOCKED 0x3d7846108 4 wq 2 cvtops x1 TM 0x1df87.0x0(ext 0x0,0x0)[38000-0002-00002C68] inst 2 
BLOCKER 0x3d75a4fd8 4 wq 2 cvtops x1 TM 0x1df87.0x0(ext 0x0,0x0)[43000-0003-000009C1] inst 3 
BLOCKED 0x3d75a4fd8 4 wq 2 cvtops x1 TM 0x1df87.0x0(ext 0x0,0x0)[43000-0003-000009C1] inst 3 
BLOCKER 0x3d2438198 4 wq 2 cvtops x1 TM 0x1df87.0x0(ext 0x0,0x0)[41000-0001-000004A5] inst 1 

The details of a WFG are slightly different compared to a non-RAC/LMD deadlock graph, not least how to interpret lock mode.

Column 8 indicates a TM lock on the object in column 9 – 0x1df87 (hex that can be converted to dec and used in dba_objects)

The lock mode in column 3 is slightly misleading in that “4” does not mean mode 4 but mode 5, thanks to the following resources:

and this:

#define KJUSERNL 0         /* no permissions */   (Null)
#define KJUSERCR 1         /* concurrent read */ (Row-S (SS))
#define KJUSERCW 2         /* concurrent write */ (Row-X (SX))
#define KJUSERPR 3         /* protected read */    (Share)
#define KJUSERPW 4         /* protected write */ (S/Row-X (SSX))
#define KJUSEREX 5         /* exclusive access */ (Exclusive)

“wq 2″ is apparently the convert queue, “wq 1″ being the grant queue.

So, in summary, deadlock issues regarding mode 5 TM locks.

In this particular case, the usual culprit was an unindexed foreign key, albeit a less common variation with a self-referencing foreign key in conjunction with updates to the primary key column – a questionable habit which in my mind then raises questions about the design.

E.g.

Session 1>create table t1
  2  (col1         number       not null
  3  ,col2         varchar2(10) not null
  4  ,related_col1 number
  5  ,constraint pk_t1 primary key (col1)
  6  ,constraint fk_t1 foreign key (related_col1) references t1 (col1));

Table created.

Session 1>insert into t1 values (1,'X',null);

1 row created.

Session 1>insert into t1 values (2,'X',null);

1 row created.

Session 1>commit;

Commit complete.

Session 1>

If we try to update the pk then session 2 blocks:

Session 1:

Session 1>update t1 set col1 = 1  where col1 = 1;

1 row updated.

Session 1>

Session 2:

Session 2>update t1 set col1 = 2 where col1 = 2;
... hangs...

Blocking on a mode 5 TM:

Session 1>select process,
  2         l.sid,
  3         type,
  4         lmode,
  5         request,
  6         do.object_name
  7  from   v$locked_object lo,
  8         dba_objects do,
  9         v$lock l
 10  where  lo.object_id   = do.object_id
 11  AND    l.sid          = lo.session_id
 12  AND    do.object_name IN ('T1');

PROCESS           SID TY      LMODE    REQUEST OBJECT_NAME
---------- ---------- -- ---------- ---------- -----------
6052:1248        1010 AE          4          0 T1
6052:1248        1010 TM          0          5 T1
496:4000         1640 AE          4          0 T1
496:4000         1640 TM          3          0 T1
496:4000         1640 TX          6          0 T1

Session 1>

It’s not hard to go and engineer a deadlock scenario but as it would be so artificial, I won’t bother.

And as expected, an index addresses the issue.

Session 1:

Session 1>rollback;

Rollback complete.

Session 1>create index i_t1 on t1 (related_col1);

Index created.

Session 1>update t1 set col1 = 1  where col1 = 1;

1 row updated.

Session 1>

Session 2:

Session 2>rollback;

Rollback complete.

Session 2>update t1 set col1 = 2 where col1 = 2;

1 row updated.

Session 2>

Fighting off RAC

As mentioned, my current client is upgrading from 9i Standard Edition on Windows to 11g Enterprise Edition on Linux.

The timeframe for this upgrade is quite aggressive. While normal development continues at a blistering pace, automated regression testing began part-time in January for an upgrade originally in March but now re-arranged for April.

Also in January, I was told that it had been decided that we would be going live on a 2-node RAC cluster. Alarm bells rang, sirens went off, I was bewildered as to how such a decision could be made.

This was a decision that management had apparently made in conjunction with the DBAs, with RAC sold as a black box database clustering solution with no question marks over its suitability.

There’s no architecture department/function and in terms of Oracle and databases, it’s a role I have previously filled elsewhere. I immediately went into overdrive, briefing against the wisdom of going to RAC before it was too late.

For us, there are numerous issues that I see with RAC:

  • The first is the upgrade timeline. The proposal, in January, was that in March we would be changing hardware, changing OS, changing database version, changing database edition and now also moving to RAC. There are plenty of references out there that suggest RAC alone can take a few months to stabilise and to nail your proper configuration. For a business that was committed to a March (even April now) deadline, this seemed to be to be a big risk.
  • Secondly, I am far convinced about the suitability of RAC for our system. It’s one thing for senior management to upgrade the business criticality of a system and give the greenlight for some spending to upgrade the platform. It’s another thing to upgrade your Ford Focus to a Ferrari only to stay stuck in the same commuting traffic. That’s my analogy of our system – lots of bad SQL, lots of statements that cope with 1001 different inputs, far too many lookups in functions called from SQL. At the end of the day there’s no getting away from the fact that our system has something like an average of 8 concurrent requests. Historically, it has been maxed out or close to being maxed out on CPU because the SQL is bad.
  • More importantly, I remain very concerned about the suitability of RAC for the profile of our system. I think of RAC as suitable for high-end OLTP systems of for multiple systems with non-OLTP systems restricting themselves to one of the nodes (unless non-conflicting activity can be determined somehow and allocated to relevant nodes). I would probably categorise our single system as a hybrid system with a very small element of OLTP and a large element of reporting. Furthermore, a large proportion of that reporting revolves around the top X% of a single table. The RAC architecture is such that repeated requests for the same blocks will cause significant waits as the instances coordinate and pass the blocks back and forth.
  • The bottom line is that RAC is complex and there’s no getting away from the fact complex is bad. Complexity eventually comes back round and bites you. And complexity usually costs more.

For now, my briefing has been a partial success, although really the only point that had any effective leverage was that the the upgrade date was under threat. A 2-node RAC solution is currently on hold for the initial upgrade. The idea being that we upgrade to 11g on the new hardware and then we expand to a 2-node cluster once that initial upgrade phase is stable.

But, and here’s the stickler, the upgrade is going to be to a 1-node RAC “cluster”. I can understand that there is a once-in-a-blue-moon opportunity to go to the new hardware, but I don’t like the smell of a 1-node RAC cluster at all. For a start, there surely has to be some overhead to the RAC installation. But we’re not even testing that. But what’s more is the cost. There is a significant licensing overhead to RAC. It’s an overhead that surely means that going to a 2-node cluster is a formality regardless. I’m just glad that it’s not my name on the cheque. It’s one of those things where I just hope that someone down the line doesn’t say that we’re moving off Oracle because it’s so expensive.

A couple of other folks’ thoughts on issues with RAC:

Follow

Get every new post delivered to your Inbox.

Join 70 other followers