Block Corruption II

As mentioned before, we’ve had a block corruption that hadn’t yet caused any problems in production but was causing issues for our 11g upgrade rehearsals with the datapump export/import stage.

Because it wasn’t causing any problems in production, the approach to addressing it was somewhat relaxed!

But what this issue did demonstrate was big concerns over the amount of redo that this system generates in a day – 500 GB – and how that volume of redo restricts the practicalities and ability to respond for a physical recovery of the problem, particularly as we only thing we knew was the recovery had happened at some point since 1st January 2012.

Fortunately we were able to identify that:

  1. Only one block in one partition of one table was affected

And we could say with a high degree of confidence that:

  1. That two rows in this one block were affected by the corruption and
  2. We could get the values from these two row from one of the uat databases and that the data hadn’t changed since.

In other words, we could resolve this corruption with a manual, logical approach.

For example, cross-referencing a block dump to see how the block is corrupted, I could get all the rows in this one block that haven’t been affected using something like this:

SELECT DBMS_ROWID.ROWID_ROW_NUMBER(rowid)
,      t.*
FROM   trade PARTITION (p_01) t
WHERE  rowid >= DBMS_ROWID.ROWID_CREATE (1,98381,24,1093595,0)
AND    rowid <  DBMS_ROWID.ROWID_CREATE (1,98381,24,1093596,0)
AND    rowid NOT IN ('AAAYBNAAYAAEK/bAAR','AAAYBNAAYAAEK/bAAT');

So, the approach for this manual fix was along the lines of:
1. Create an empty copy of the affected table.

CREATE TABLE repaired_table
TABLESPACE ....
AS
SELECT *
FROM   TRADE
WHERE  1 = 2;

2. Copy all the data apart from this one block into the new table

INSERT
INTO   repaired_table
SELECT * 
FROM   trade PARTITION (P_01) 
WHERE  rowid < DBMS_ROWID.ROWID_CREATE (1,98381,24,1093595,0)
UNION ALL
SELECT *
FROM   trade PARTITION (P_01)
WHERE  rowid >= DBMS_ROWID.ROWID_CREATE (1,98381,24,1093596,0);

3. Insert rows from block that were ok (double checked via block dump)

INSERT 
INTO   TRADE
SELECT t.*
FROM   trade PARTITION (p_01) t
WHERE  rowid >= DBMS_ROWID.ROWID_CREATE (1,98381,24,1093595,0)
AND    rowid <  DBMS_ROWID.ROWID_CREATE (1,98381,24,1093596,0)
AND    rowid NOT IN ('AAAYBNAAYAAEK/bAAR','AAAYBNAAYAAEK/bAAT');

4. Insert good versions of the corrupted rows extracted from an older UAT database.
5. Create mirror of local indexes on copy table

6. Partition exchange without validation including indexes

7. Rebuild global indexes on original table

8. Do stuff with stats if you needed to (we recovered stats from a stats backup table)

At this point, table REPAIRED_TRADE now has the corrupt block allocated.

Even if we drop the table, DBVERIFY still reports the corrupt block as it has not been formatted.

This is normal as the block will not be formatted until it is taken off the freelist and used by a new segment.

For completeness, we wanted to get this block formatted.

Because we did not drop table REPAIRED_TRADE we know where this block is.

So:

1. TRUNCATE TABLE repaired_trade REUSE STORAGE;

2. Drop the indexes on this table that we wanted for ease of partition exchange

3. Create trigger to tell us when we format the block in question:

CREATE OR REPLACE TRIGGER corrupt_trigger 
AFTER INSERT ON repaired_trade 
REFERENCING OLD AS p_old NEW AS new_p 
FOR EACH ROW 
DECLARE 
  e_corrupt EXCEPTION; 
BEGIN 
  IF  (DBMS_ROWID.ROWID_BLOCK_NUMBER(:new_p.rowid)=1093595)
  AND (DBMS_ROWID.ROWID_RELATIVE_FNO(:new_p.rowid)=24) 
  THEN 
      RAISE e_corrupt; 
  END IF; 
EXCEPTION 
  WHEN e_corrupt THEN 
     RAISE_APPLICATION_ERROR(-20000, 'Corrupt block has been formatted'); 
END; 
/

4. Reinsert data into this table from original TRADE table

INSERT
INTO   repaired_trade
SELECT * FROM trade;

5. Repeat 4 if necessary, i.e. until we get the exception then

6. DROP TABLE repaired_trade;

At this point, DBVERIFY will no longer report that there is a corrupt block.

Block corruption I

Sod’s law – the penultimate dress rehearsal for our 11gR2 upgrade showed that we have suddenly have a block corruption issue on our 9i production database.

Mind you better to find out now than go-live weekend.

It’s only one block and actually it only seems to be two rows in that one block.

The response to the severity 1 SR was less than impressive.

Anyway… as you might expect, the impact of a block corruption can be unpredictable.

Some observations from this particular case:

1. Can’t access the table block by PK.

SQL> explain plan for
  2  select * from trade where trad_tag = 185263584;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------
| Id  | Operation                          |  Name       | Rows  | Bytes | Cost (%CPU)| Pstart| Pstop |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |             |     1 |   332 |    13   (8)|       |       |
|   1 |  PARTITION LIST ALL                |             |       |       |            |     1 |    11 |
|   2 |   TABLE ACCESS BY LOCAL INDEX ROWID| TRADE       |     1 |   332 |    13   (8)|     1 |    11 |
|*  3 |    INDEX RANGE SCAN                | TRADE_IDX   |     1 |       |    24   (5)|     1 |    11 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("TRADE"."TRAD_TAG"=185263584)

14 rows selected.

SQL> select * from trade where trad_tag = 185263584;
select * from trade where trad_tag = 185263584
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel

2. Can access the index with no table access.

SQL> explain plan for
  2  select rowid rd, trad_tag from trade where trad_tag = 185263584;

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------
| Id  | Operation            |  Name       | Rows  | Bytes | Cost (%CPU)| Pstart| Pstop |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |             |     1 |    14 |    12   (0)|       |       |
|   1 |  PARTITION LIST ALL  |             |       |       |            |     1 |    11 |
|*  2 |   INDEX RANGE SCAN   | TRADE_IDX   |     1 |    14 |    24   (5)|     1 |    11 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("TRADE"."TRAD_TAG"=185263584)

13 rows selected.

SQL> select rowid rd, trad_tag from trade where trad_tag = 185263584;

RD                   TRAD_TAG
------------------ ----------
AAAYBNAAYAAEK/bAAR  185263584

3. But can access the corruption directly via rowid:

SQL> explain plan for
  2  select trad_tag, int_override_ind, length(int_override_ind)
  3  from   trade
  4  where  rowid = 'AAAYBNAAYAAEK/bAAR';

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------
| Id  | Operation                  |  Name       | Rows  | Bytes | Cost (%CPU)| Pstart| Pstop |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |             |     1 |    16 |     2  (50)|       |       |
|   1 |  TABLE ACCESS BY USER ROWID| TRADE       |     1 |    16 |     2  (50)| ROWID | ROW L |
-----------------------------------------------------------------------------------------------

7 rows selected.

SQL> select trad_tag, int_override_ind, length(int_override_ind)
  2  from   trade
  3  where  rowid = 'AAAYBNAAYAAEK/bAAR';

  TRAD_TAG I LENGTH(INT_OVERRIDE_IND)
---------- - ------------------------
           N                        8

SQL> 

4. But an index of one of the corrupted values in the corrupt block is also affected:

SQL> explain plan for 
  2  select route_code, count(*) from trade group by route_code;

Explained.

SQL>  select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------
| Id  | Operation              |  Name        | Rows  | Bytes | Cost (%CPU)| Pstart| Pstop |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |              |     3 |     6 | 15098  (57)|       |       |
|   1 |  SORT GROUP BY         |              |     3 |     6 | 15098  (57)|       |       |
|   2 |   PARTITION LIST ALL   |              |       |       |            |     1 |    11 |
|   3 |    INDEX FAST FULL SCAN| TRADE_7_IDX  |    30M|    59M|  7452  (13)|     1 |    11 |
--------------------------------------------------------------------------------------------

9 rows selected.


SQL> select route_code, count(*) from trade group by route_code;
ERROR:
ORA-00600: internal error code, arguments: [%s], [%s], [%s], [%s], [%s], [%s], [%s], [%s]

For one illustration of the corruption, see this VARCHAR2(1) column:

SQL> desc trade
 Name                                                              Null?    Type
 ----------------------------------------------------------------- -------- -----------------
 ...
 INT_OVERRIDE_IND                                                  NOT NULL VARCHAR2(1)
 ...

SQL> select int_override_ind, length(int_override_ind) lngth, dump(int_override_ind) dmp
  2  from trade
  3  where trad_tag = 185263584;

I      LNGTH DMP
- ---------- --------------------------------------------------
N          8 Typ=1 Len=8: 78,1,78,255,255,1,128,1

Back to access by rowid…

5. This works in Toad but not SQL*Plus, i.e. there must be some significant difference in how the
different clients deal with their resultsets:

SQL> select * 
  2  from   trade 
  3  where  rowid IN ( chartorowid('AAAYBNAAYAAEK/bAAR'), 
  5                    chartorowid('AAAYBNAAYAAEK/bAAT'));
ERROR:
ORA-01858: a non-numeric character was found where a numeric was expected


no rows selected

SQL> 

That’s it really.

Obviously we need to fix it.

Follow

Get every new post delivered to your Inbox.

Join 69 other followers