Oracle Text – 5 million rows part II

Today is day six … And it’s still running.

At this point, I definitely want it to get to the end. Then I can back it up, export it, whatever, and then I will be in a position to ditch it, try some different settings and see what happens. The last thing that I want to do is to have to wait another six days to get back to where I am now.

So, as I write, one of the four parallel indexing processes has indexed 1122300 documents. The target table has 4572115 rows, so we’re nearly there. 

An important thing to note in all this is that this is a development system with nothing else going on. There are no updates currently happening to the indexed content and no other queries being issued by applications. So, the indexing process has pretty much got the box to itself. This is a Linux box with 2 dual core CPUs and 4Gs of memory.

Before today, I have not had much time to consider what’s going here and investigate while it’s still indexing. But there should be a small window before it finishes.

The first thing to consider is why I jumped in on some of the creation parameters that I did.
As a recap, here’s the initial creation statement:

1 CREATE INDEX i_myindex_ctx01
2 ON mytable(ctx_search_multistore_column)
3 INDEXTYPE IS CTXSYS.CONTEXT
4 PARAMETERS ('LEXER my_lexer
5 STORAGE my_storage
6 DATASTORE my_datastore
7 SECTION GROUP my_section
8 FILTER CTXSYS.NULL_FILTER
9 STOPLIST CTXSYS.EMPTY_STOPLIST
10 TRANSACTIONAL
11 SYNC (EVERY "SYSDATE+1/24")
12 MEMORY 250M')
13* PARALLEL 4

I have a BASIC_LEXER set up like this:

CTX_DDL.CREATE_PREFERENCE('my_lexer','BASIC_LEXER');
CTX_DDL.SET_ATTRIBUTE ('my_lexer','INDEX_TEXT','true');
CTX_DDL.SET_ATTRIBUTE ('my_lexer','INDEX_THEMES','false');
CTX_DDL.SET_ATTRIBUTE ('my_lexer','ALTERNATE_SPELLING','none');

BASIC_STORAGE is set up like this

CTX_DDL.CREATE_PREFERENCE('my_storage' ,'BASIC_STORAGE');
CTX_DDL.SET_ATTRIBUTE (''my_storage' ,'I_TABLE_CLAUSE','TABLESPACE ctx_index');
CTX_DDL.SET_ATTRIBUTE ('my_storage' ,'K_TABLE_CLAUSE','TABLESPACE ctx_index');
CTX_DDL.SET_ATTRIBUTE ('my_storage' ,'R_TABLE_CLAUSE','TABLESPACE ctx_index LOB (DATA) STORE AS (CACHE)');
CTX_DDL.SET_ATTRIBUTE ('my_storage' ,'N_TABLE_CLAUSE','TABLESPACE ctx_index');
CTX_DDL.SET_ATTRIBUTE ('my_storage' ,'I_INDEX_CLAUSE','TABLESPACE ctx_index COMPRESS 2');
CTX_DDL.SET_ATTRIBUTE ('my_storage' ,'P_TABLE_CLAUSE','TABLESPACE ctx_index');

A MULTI_COLUM_DATASTORE like this:

CTX_DDL.CREATE_PREFERENCE('my_datastore','MULTI_COLUMN_DATASTORE');
CTX_DDL.SET_ATTRIBUTE ('my_datastore','COLUMNS','headline,summary,body');
CTX_DDL.SET_ATTRIBUTE ('my_datastore','FILTER','y,y,y');

And finally a SECTION_GROUP like this:

CTX_DDL.CREATE_SECTION_GROUP('my_section','BASIC_SECTION_GROUP');
CTX_DDL.ADD_FIELD_SECTION ('my_section','HEADLINE','HEADLINE',TRUE);
CTX_DDL.ADD_FIELD_SECTION ('my_section','SUMMARY','SUMMARY',TRUE);
CTX_DDL.ADD_FIELD_SECTION ('my_section','BODY','BODY',TRUE);

These various options above are going to affect what is indexed and how. How much overhead? No idea. But gut feeling is that it’s going to be some but negligible in the grand scheme of six days.

Before creating the index, I started logging:

-- Log index creation
BEGIN
CTX_OUTPUT.START_LOG('news_index_creation.log');
END;
/

Regarding some of the other parameters in the index creation, the combination of TRANSACTIONAL setting and SYNC setting affect whether updates to the indexed content are seen immediately or a short while later. Default behaviour is for the index update to be asynchronous. The TRANSACTIONAL keyword refers to TRANSACTIONAL QUERY behaviour -inserts and updates on a transactional index get logged in dr$pending, like normal context indexes, but the rowids also get logged in a second table, dr$unindexed. There may be a performance impact to queries on a TRANSACTIONAL index because the dr$unidexed table is scanned. Having set this parameter, then it is best practice to not then SYNC ON COMMIT if it can be avoided. The latter can lead to index fragmentation which leads to slower queries. So, in an editorial type system – which this is – TRANSACTIONAL helps capture the frequent changes in the early lifecycle of an article, and the hourly SYNC reduces the fragmentation but still should be synchronising regularly enough. We’ll see.

I would not expect these two settings to affect index creation time (especially not in a development system where the content is not being updated).

I included two other parameters that would affect indexing speed.
And I had hoped that what I had done would speed things up (as I said above, once finished I can export and then compare speed with default settings, although I can’t compare to initial indexing speed to the speed once it’s indexed a few million docs).

First up – PARALLEL.
My thinking was that with 2 dual core CPUs, I would try a parallelism of 4.
And the MEMORY setting – my understanding of the memory setting is that a large index memory setting can improve the speed of index creation and reduce the final index fragmentation but then that, if set too high, then paging can cripple index speed.

Well, as I get towards the end, I would certainly say that my indexing speed has been crippled. Although, it wasn’t to start with.

Each parallel process indexes a bunch of the documents in memory and then flushes that to the index tables once the memory is full. Before 10g, this was flushed using normal SQL statements. In 10g, the mechanism is direct path inserts.

Looking at one of the four log files (one for each parallel process), I can see that now we’re at one million documents per process, the memory is currently flushed every 70000 documents. The exact number of documents is going to depend on the size of those documents. However, from the start of creation, the first flush was at 150,000 documents, the next at 120,000, then 95,000 and slowly but steadily decreasing.

However, another part of that same logging information shows that, despite the decreasing number of documents, the number of rows written to the I$ tables is relatively constant at about one million rows of which around 30,000 tend to be what it calls “big rows”. In addition, in the same logging, the memory use is confirmed to be around the 250M mark.

So, the in-memory indexing has slowed down dramatically since this started but the number of rows being flushed to I$ by that in-memory process is relatively constant.

So, I definitely need to have a look at the paging activity. So, I need to look at OS utilities vmstat and sar – not an area that I’ve ever looked at much previously. And because this is 10g, I can have a gander at V$OSSTAT as well. The latter might have had some interesting information in it with statnames such as VM_PAGE_OUT_BYTES and VM_PAGE_IN_BYTES but the manual lists these as OS dependent and they’re not in my view.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 75 other followers

%d bloggers like this: