Online Statistics Gathering

Online Statistics Gathering, https://sqlmaria.com/2017/01/03/online-statistics-gathering/, is a pretty cool feature for the ages-old version that is Oracle 12.1, especially data warehouses.

As a recap, when you do direct path load into an empty segment, Oracle will take advantage of gathering statistics on the data whilst it is loading it, i.e no need to go back and re-read in an explicit stats gathering.

Note that, per Maria’s note, only base table and column statistics are gathered, not histograms but those can be gathered without re-gathering base column statistics via GATHER AUTO.

Old news.

We have only recently upgraded a complex application from 11.2.0.4 and hit an issue which is really obvious in hindsight but just worth pointing out, particular in combination with another feature?

I’m sure the behaviour below will come as no surprise in 11.2.0.4

drop table t1;
create global temporary table t1
on commit preserve rows
as
select * from user_objects where rownum <= 100;

explain plan for
select * from t1;

select * from table(dbms_xplan.display);
select num_rows from user_tables where table_name = 'T1';
select column_name, num_nulls, num_distinct from user_tab_columns where table_name = 'T1';

The output of note:

Plan hash value: 3617692013
----------------------------------------------------------------------------------
| Id | Operation                      | Name | Rows | Bytes | Cost (%CPU)| Time  |
----------------------------------------------------------------------------------
|  0 | SELECT STATEMENT               |      |  100 | 19000 |   2  (0)| 00:00:01 |
|  1 | TABLE ACCESS STORAGE FULL      | T1   |  100 | 19000 |   2  (0)| 00:00:01 |
----------------------------------------------------------------------------------

Note
-----
  - dynamic sampling used for this statement (level=2)


 NUM_ROWS
----------
      

COLUMN_NAME           NUM_NULLS NUM_DISTINCT
-------------------- ---------- ------------
GENERATED                       
SECONDARY                       
NAMESPACE                       
EDITION_NAME                      
OBJECT_NAME                      
SUBOBJECT_NAME                     
OBJECT_ID                       
DATA_OBJECT_ID                     
OBJECT_TYPE                      
CREATED                        
LAST_DDL_TIME                     
TIMESTAMP                       
STATUS                         
TEMPORARY                       
                 

We load 100 rows, there are no stats, dynamic sampling kicks in, bingo the optimizer expects 100 rows.

Then for > 11.2.0.4 ( in this case 19.6), with the addition of SCOPE column in the script for 19c, we see that we now have table and column statistics after the load despite no explicit stats gathering call. Which, given the feature we’re discussing, should be completely obvious and expected.

I’ve used a temporary table here deliberately purely to facilitate a comment at the end. Clearly this could be a normal heap table – but the SCOPE in the output is specific to the new feature of temporary table private statistics.

Plan hash value: 3617692013
----------------------------------------------------------------------------------
| Id | Operation                   | Name | Rows | Bytes | Cost (%CPU)|     Time |
----------------------------------------------------------------------------------
|  0 | SELECT STATEMENT            |      |  100 | 11300 |      2  (0)| 00:00:01 |
|  1 | TABLE ACCESS STORAGE FULL   | T1   |  100 | 11300 |      2  (0)| 00:00:01 |
----------------------------------------------------------------------------------

Note
-----
  - Global temporary table session private statistics used


  NUM_ROWS SCOPE  
---------- -------
           SHARED 
       100 SESSION

COLUMN_NAME            NUM_NULLS NUM_DISTINCT SCOPE  
--------------------- ---------- ------------ -------
OBJECT_NAME                    0            7 SESSION
SUBOBJECT_NAME                 7           93 SESSION
OBJECT_ID                      0          100 SESSION
DATA_OBJECT_ID                 3           97 SESSION
OBJECT_TYPE                    0            3 SESSION
CREATED                        0           95 SESSION
LAST_DDL_TIME                  0           97 SESSION
TIMESTAMP                      0           96 SESSION
STATUS                         0            1 SESSION
TEMPORARY                      0            1 SESSION
GENERATED                      0            2 SESSION
SECONDARY                      0            1 SESSION
NAMESPACE                      0            1 SESSION
EDITION_NAME                 100            0 SESSION
SHARING                        0            1 SESSION
EDITIONABLE                  100            0 SESSION
ORACLE_MAINTAINED              0            1 SESSION
APPLICATION                    0            1 SESSION
DEFAULT_COLLATION             94            1 SESSION
DUPLICATED                     0            1 SESSION
SHARDED                        0            1 SESSION
CREATED_APPID                100            0 SESSION
CREATED_VSNID                100            0 SESSION
MODIFIED_APPID               100            0 SESSION
MODIFIED_VSNID               100            0 SESSION

Again, we’ve loaded 100 rows, as part of the direct path load, we have some base & column statstics, the optimizer expects 100 rows. No dynamic sampling necessary

But what if our data is loaded in multiple steps, might this new online statistics gathering cause us a bit of a problem?

It only gathers the stats when the segment is empty.

I’m not going to repeat the output above for the table & column statistics, just the execution plan detail.

drop table t1;
create global temporary table t1
on commit preserve rows
as
select * from user_objects where rownum <= 100;
insert /*+ append */ into t1
select * from user_objects
commit;

insert /*+ append */ into t1
select * from user_objects
commit;

explain plan for
select * from t1;

Which in 11.2.0.4 gives:

Plan hash value: 3617692013
---------------------------------------------------------------------------------------
| Id | Operation                      | Name |   Rows | Bytes | Cost (%CPU)|    Time  |
---------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT               |      |  32376 | 6007K |    93   (0)| 00:00:01 |
|  1 | TABLE ACCESS STORAGE FULL      | T1   |  32376 | 6007K |    93   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Note
-----
  - dynamic sampling used for this statement (level=2)

select count(*) from t1;

  COUNT(*)
----------
     34392

And in 19.6

Plan hash value: 3617692013
----------------------------------------------------------------------------------
| Id | Operation                   | Name | Rows | Bytes | Cost (%CPU)|     Time |
----------------------------------------------------------------------------------
|  0 | SELECT STATEMENT            |      |  100 | 11300 |      2  (0)| 00:00:01 |
|  1 | TABLE ACCESS STORAGE FULL   | T1   |  100 | 11300 |      2  (0)| 00:00:01 |
----------------------------------------------------------------------------------

Note
-----
  - Global temporary table session private statistics used

select count(*) from t1;

  COUNT(*)
----------
     59712*

*With an obvious difference in actual rows, because these are different environments on different versions.

The problem now is that our subsequent processing code which drives off a table populated with different volumes of data in multiple stages used to work fine in 11.2.0.4 but is now at risk of some serious consequences from having cardinality mis-estimates thanks to inaccurate statistics from our new feature which is behaving as it should.

In our real-world example, it was not always the case that there was a big skew between the first load and the subsequent loads and actually this consequence, obvious in hindsight, took a few weeks to reveal itself until that first load into an empty segment happened to be very small followed by some other loads which were much larger.

Furthermore, this also involved the private temporary table aspect and so when you’re looking at solutions, you might consider whether some explicit stats gathering at the end of the load is the solution – in which case should it be private stats or shared stats or whether it even matters when quite possibly the easiest solution is to just use dynamic sampling in the subsequent select/s.

I was reminded of this issue when writing this other article earlier on a more serious problem that online statistics gathering was causing:

One Response to Online Statistics Gathering

  1. feel4data says:

    nice finding.

Leave a comment