Oracle Text & Query Templating – Query Rewrite / Relaxation

Traditional use of Oracle Text involves writing a WHERE clause using the CONTAINS operator on the relevant column and SELECTing the score using the SCORE operator, e.g.


SELECT programmename
, score(1)
FROM programme
WHERE CONTAINS( programmename , ’Weakest Link’,1) > 0
ORDER BY score(1) DESC

Query templates are features new to Oracle 10g and are an alternative to the previous query language. Rather than passing a query string, as illustrated above, you pass a structured document which contains the query string in a tagged element. Within this structured document, you can make use of additional query features such as query rewrite and query relaxation.

The query rewrite feature enables you to submit a single query that expands the original query into the rewritten versions, e.g.


SELECT programmename
, score(1)
FROM programme
AND CONTAINS ( programmename , '
<query>
<textquery lang="ENGLISH" grammar="CONTEXT">Weakest Link
<progression>
<seq><rewrite>transform((TOKENS, "{", "}", "AND"))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "SYN(", ",LISTING)", "AND"))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "!", "%", "AND"))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "?{", "}", "AND"))</rewrite>/seq>
<seq><rewrite>transform((TOKENS, "${", "}", "AND"))</rewrite></seq>
</progression>
</textquery>
<score datatype="INTEGER" algorithm="DEFAULT"/>
</query>', 1) > 0
ORDER BY score(1) DESC

Query relaxation looks similar to query rewrite and enables the search to execute the most restrictive version of a query first, progressively relaxing the query until the required number of hits are obtained, e.g.


SELECT programmename
, score(1)
FROM programme
AND CONTAINS ( programmename , '
<query>
<textquery lang="ENGLISH" grammar="CONTEXT">Weakest Link
<progression>
<seq>SYN(Weakest Link,LISTING)</seq>
<seq>!{weakest link}%</seq>
<seq>?weakest ?link</seq>
<seq>weakest OR link
<seq>?weakest OR ?link</seq>
<seq>fuzzy({weakest link},60,100,W)</seq>
</progression> </textquery>
<score datatype="INTEGER" algorithm="DEFAULT"/>
</query>', 1) > 0
ORDER BY score(1) DESC

Usage of these features benefit a search application where a variety of searches – exact match, synonym searching, traditional fuzzy (spelling), stem search – may be required to find the best match.

Traditionally, a strategy for handling this would be to issue a number of different individual queries or one query with a number of ORs in the WHERE clause. This approach is inefficient and duplicate results may be returned.
Using query templating to relax a query as such is more efficient than re-executing a query. In addition, by using query templating rather than individual queries, duplicate results are eliminated.
The scores returned are manipulated such that if you order by score, you can be sure that all the rows specified by an earlier criteria will be returned before those specified by a later criteria.
The scoring mechanism used is slightly different. The algorithm as explained above is still applied to each search but the percentages returned are banded according to which search returned the result. In other words, if we have 4 different searches within our query template, the top search would return hits scored between 100% and 76%, the next 75%- 51%, the next 50% – 26% and the last 25% to 1%.

Scoring Algorithm Addendum
It should be noted that the above examples do not demonstrate a good usage of the default scoring algorithm based on the nature of the data indexed. The examples above are taken from an enterprise application where the programme table holds television programme titles, e.g. Eastenders, Weakest Link, etc.

The nature of the default scoring algorithm is such that a search of “Eastenders” against a one word indexed document of “Eastenders” would not return 100%. This is because the algorithm depends on the number of occurences of a searched term against the length of the document. From the docs:


To calculate a relevance score for a returned document in a word query, Oracle Text uses an inverse frequency algorithm based on Salton's formula.

Inverse frequency scoring assumes that frequently occurring terms in a document set are noise terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole.

The following table illustrates Oracle Text's inverse frequency scoring. The first column shows the number of documents in the document set, and the second column shows the number of terms in the document necessary to score 100:

Number of Documents in Document Set Occurences of Term in Document Needed To Score 100
1 34
5 20
10 17
50 13
100 12
500 10
1000 9
10,000 7
100,000 5
1,000,000 4

Oracle Text – 5 million rows, ADDM insight.

Now that the indices are built, Oracle Text query performance is fine with typically sub-second -> second response time observed.

As mentioned previously, at some point, I’m going to have to rebuild these and investigate this crippling index creation but not just yet.

This is a vanilla development 10gR2 install which has given me a short but sweet introduction to ADDM via OEM, as mentioned previously.

OEM homepage > Advisor Central > ADDM

The default on install has given me a week’s worth of AWR snapshots gathered at hourly intervals. Whether this will be enough for most problem resolution depends on the problem and the system. General best practice is to gather statspack snaps at a certain level every 15 minutes. What we’re looking for is for a short enough time period that the problem is significant and identifiable without imposing too large an overhead in terms of storage and performance of doing the snap itself. Jonathan Lewis advocates, in general, a level 0 snap every 15 minutes with a level 5 or greater every hour but with some caveats.

 Below are the reports from one particular one-hour period during the six day journey:

AWR Report

Tuning Task Report

Oracle Text – 5 million rows , the final countdown

Finally, the parallel processes have finished indexing the documents – the last thread finished at 5:27 pm yesterday – but, alas, the index creation has not finished yet.

Here is one of the log files from the index creation process.

Oracle Text – 5 million rows part III

One of the parallel indexing processes has finished, so the job’s nearly done now.

One of the interesting things about an Oracle Text index creation is that the whole thing is not transactional – thank goodness in this case.

For example, with a normal index,  if you make an error in the index creation DDL, the whole thing rolls back.
With an Oracle Text index, if you make an error in the syntax (and maybe it depends in which particular bit of the syntax but I made an error in the MEMORY clause and the PARALLEL clause), you have to drop the index even though the creation errored – so the whole thing is not wrapped in a single transaction.

Anyway, I’m not really any closer to figuring out why my initial index creation is suffering from chronic slowdown as it gets nearer to finishing but now I’m running out of time to look because it’s nearly finished.

But, this being 10g, I thought I’d have a trawl around the extended information in the OWI views, do a statspack snap and also use the EM Gui and the ADDM pages.

I’m not usually one for GUIs like Enterprise Manager and Toad, partly because the more you use them, the more you forget how to do stuff old skool – you forget commands, views, etc. However, in more and more installations that I go to, these tools are the de facto standard, particularly with Toad and with me being more of a developer than a DBA. And against an accomplished GUI button presser, typing a bit of SQL and then correcting the basic typos starts to make you look a big cumbersome. Obviously, the ideal is to know all the buttons and functionality built in to the GUIs whilst retaining knowledge of the views and SQL to do the job behind the scenes.

Anyway, I’m in the process of looking at this stuff now.

I had a look at the EM console using:
emctl start dbconsole
From the home page, I clicked on the Active Sessions icon which took me into the Performance tab, then I clicked on the Run ADDM button which, after an interstitial confirmation, took me into the ADDM screen, clicked on View Snapshots and then selected the Reports tab.

Obviously, there’s a whole heap of interesting information here.
But, from a developer point of view, it’s interesting to note some of the SQL that the Oracle Text index creation seems to be using behind the scenes.

select column_value
from table ( cast ( ctxsys.drvparx.ParallelPopuIndex ( cursor(select /*+ PARALLEL(base 4) FULL(base) */ rowid, NULL, NULL, NULL, 'N' from "MYSCHEMA"."MYTABLE" base), :idxid, :idxown, :idxname, :ixpname, :popstate) as sys.odcivarchar2list))

(Note that I’ve continued to substitute the real table name with ‘mytable’)
(ODCIVARCHAR2LIST is a VARRAY(32767) of VARCHAR2(4000) – didn’t know that.)

This snapshot report takes place over a 30 minute period during which the index creation is still in full swing.
So, given that this query is probably the main driver of the index creation, the elapsed time for this SQL was 99% of the total DB time.

No other SQL takes any significant time (well there’s only 1% left after the one above), but the next highest one is:

SELECT /*+ FULL ("A2") PARALLEL ("A2", 4) */ "A2".ROWID "ROWID", NULL "NULL", NULL "NULL", NULL "NULL", 'N' "'N'" FROM "MYSCHEMA"."MYTABLE" "A2"

and then

select headline, summary, body from "MYSCHEMA"."MYTABLE" where rowid = :rid

Oracle Text – 5 million rows part II

Today is day six … And it’s still running.

At this point, I definitely want it to get to the end. Then I can back it up, export it, whatever, and then I will be in a position to ditch it, try some different settings and see what happens. The last thing that I want to do is to have to wait another six days to get back to where I am now.

So, as I write, one of the four parallel indexing processes has indexed 1122300 documents. The target table has 4572115 rows, so we’re nearly there. 

An important thing to note in all this is that this is a development system with nothing else going on. There are no updates currently happening to the indexed content and no other queries being issued by applications. So, the indexing process has pretty much got the box to itself. This is a Linux box with 2 dual core CPUs and 4Gs of memory.

Before today, I have not had much time to consider what’s going here and investigate while it’s still indexing. But there should be a small window before it finishes.

The first thing to consider is why I jumped in on some of the creation parameters that I did.
As a recap, here’s the initial creation statement:

1 CREATE INDEX i_myindex_ctx01
2 ON mytable(ctx_search_multistore_column)
3 INDEXTYPE IS CTXSYS.CONTEXT
4 PARAMETERS ('LEXER my_lexer
5 STORAGE my_storage
6 DATASTORE my_datastore
7 SECTION GROUP my_section
8 FILTER CTXSYS.NULL_FILTER
9 STOPLIST CTXSYS.EMPTY_STOPLIST
10 TRANSACTIONAL
11 SYNC (EVERY "SYSDATE+1/24")
12 MEMORY 250M')
13* PARALLEL 4

I have a BASIC_LEXER set up like this:

CTX_DDL.CREATE_PREFERENCE('my_lexer','BASIC_LEXER');
CTX_DDL.SET_ATTRIBUTE ('my_lexer','INDEX_TEXT','true');
CTX_DDL.SET_ATTRIBUTE ('my_lexer','INDEX_THEMES','false');
CTX_DDL.SET_ATTRIBUTE ('my_lexer','ALTERNATE_SPELLING','none');

BASIC_STORAGE is set up like this

CTX_DDL.CREATE_PREFERENCE('my_storage' ,'BASIC_STORAGE');
CTX_DDL.SET_ATTRIBUTE (''my_storage' ,'I_TABLE_CLAUSE','TABLESPACE ctx_index');
CTX_DDL.SET_ATTRIBUTE ('my_storage' ,'K_TABLE_CLAUSE','TABLESPACE ctx_index');
CTX_DDL.SET_ATTRIBUTE ('my_storage' ,'R_TABLE_CLAUSE','TABLESPACE ctx_index LOB (DATA) STORE AS (CACHE)');
CTX_DDL.SET_ATTRIBUTE ('my_storage' ,'N_TABLE_CLAUSE','TABLESPACE ctx_index');
CTX_DDL.SET_ATTRIBUTE ('my_storage' ,'I_INDEX_CLAUSE','TABLESPACE ctx_index COMPRESS 2');
CTX_DDL.SET_ATTRIBUTE ('my_storage' ,'P_TABLE_CLAUSE','TABLESPACE ctx_index');

A MULTI_COLUM_DATASTORE like this:

CTX_DDL.CREATE_PREFERENCE('my_datastore','MULTI_COLUMN_DATASTORE');
CTX_DDL.SET_ATTRIBUTE ('my_datastore','COLUMNS','headline,summary,body');
CTX_DDL.SET_ATTRIBUTE ('my_datastore','FILTER','y,y,y');

And finally a SECTION_GROUP like this:

CTX_DDL.CREATE_SECTION_GROUP('my_section','BASIC_SECTION_GROUP');
CTX_DDL.ADD_FIELD_SECTION ('my_section','HEADLINE','HEADLINE',TRUE);
CTX_DDL.ADD_FIELD_SECTION ('my_section','SUMMARY','SUMMARY',TRUE);
CTX_DDL.ADD_FIELD_SECTION ('my_section','BODY','BODY',TRUE);

These various options above are going to affect what is indexed and how. How much overhead? No idea. But gut feeling is that it’s going to be some but negligible in the grand scheme of six days.

Before creating the index, I started logging:

-- Log index creation
BEGIN
CTX_OUTPUT.START_LOG('news_index_creation.log');
END;
/

Regarding some of the other parameters in the index creation, the combination of TRANSACTIONAL setting and SYNC setting affect whether updates to the indexed content are seen immediately or a short while later. Default behaviour is for the index update to be asynchronous. The TRANSACTIONAL keyword refers to TRANSACTIONAL QUERY behaviour -inserts and updates on a transactional index get logged in dr$pending, like normal context indexes, but the rowids also get logged in a second table, dr$unindexed. There may be a performance impact to queries on a TRANSACTIONAL index because the dr$unidexed table is scanned. Having set this parameter, then it is best practice to not then SYNC ON COMMIT if it can be avoided. The latter can lead to index fragmentation which leads to slower queries. So, in an editorial type system – which this is – TRANSACTIONAL helps capture the frequent changes in the early lifecycle of an article, and the hourly SYNC reduces the fragmentation but still should be synchronising regularly enough. We’ll see.

I would not expect these two settings to affect index creation time (especially not in a development system where the content is not being updated).

I included two other parameters that would affect indexing speed.
And I had hoped that what I had done would speed things up (as I said above, once finished I can export and then compare speed with default settings, although I can’t compare to initial indexing speed to the speed once it’s indexed a few million docs).

First up – PARALLEL.
My thinking was that with 2 dual core CPUs, I would try a parallelism of 4.
And the MEMORY setting – my understanding of the memory setting is that a large index memory setting can improve the speed of index creation and reduce the final index fragmentation but then that, if set too high, then paging can cripple index speed.

Well, as I get towards the end, I would certainly say that my indexing speed has been crippled. Although, it wasn’t to start with.

Each parallel process indexes a bunch of the documents in memory and then flushes that to the index tables once the memory is full. Before 10g, this was flushed using normal SQL statements. In 10g, the mechanism is direct path inserts.

Looking at one of the four log files (one for each parallel process), I can see that now we’re at one million documents per process, the memory is currently flushed every 70000 documents. The exact number of documents is going to depend on the size of those documents. However, from the start of creation, the first flush was at 150,000 documents, the next at 120,000, then 95,000 and slowly but steadily decreasing.

However, another part of that same logging information shows that, despite the decreasing number of documents, the number of rows written to the I$ tables is relatively constant at about one million rows of which around 30,000 tend to be what it calls “big rows”. In addition, in the same logging, the memory use is confirmed to be around the 250M mark.

So, the in-memory indexing has slowed down dramatically since this started but the number of rows being flushed to I$ by that in-memory process is relatively constant.

So, I definitely need to have a look at the paging activity. So, I need to look at OS utilities vmstat and sar – not an area that I’ve ever looked at much previously. And because this is 10g, I can have a gander at V$OSSTAT as well. The latter might have had some interesting information in it with statnames such as VM_PAGE_OUT_BYTES and VM_PAGE_IN_BYTES but the manual lists these as OS dependent and they’re not in my view.

Oracle Text – 5 million rows, 100 hours and counting

Last week, I was using DBMS_REFINITION to migrate a LONG column on a table to a CLOB so that I could use that column in a MULTI_COLUMN_DATASTORE for an Oracle Text search.

 Unfortunately, on a development system with nothing else happening, it took hours and hours. I left it running with the datafiles set to AUTOEXTEND ON and I think it took the best part of a day.

I knew that I would be off site on Friday and the following Monday so I set off an Oracle Text index creation, thinking that it might take a day or so, at worst.

However, as I write, it’s still going. So, started on the 1st, still going on the 6th. Five million rows. Seems a bit excessive to me.

And as it gets towards the end, although we’re not there yet, the indexing has slowed right down. Initially, each parallel process was indexing 100 documents every 6-8 seconds. Now each 100 documents takes nearly 1 minute so, practically, we’re never going to get to the finish.
CREATE INDEX i_myindex_ctx01
ON mytable(ctx_search_multistore_column)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS ('LEXER mylexer
STORAGE mystorage
DATASTORE mydatastore
SECTION GROUP mysection
FILTER CTXSYS.NULL_FILTER
STOPLIST CTXSYS.EMPTY_STOPLIST
TRANSACTIONAL
SYNC (EVERY "SYSDATE+1/24")
MEMORY 250M')
PARALLEL 4;

There must be some more optimisation that I can do – will need to research, but definitive advice is a bit thin on the ground.

For development purposes, I could make the table NOLOGGING but that probably wouldn’t reflect what I could do in production eventually.

Follow

Get every new post delivered to your Inbox.

Join 70 other followers