NLS_LENGTH_SEMANTICS

If you’ve ever messed about with multibyte charactersets then you will probably have come across NLS_LENGTH_SEMANTICS – i.e the difference between a VARCHAR2(200 BYTE) and a VARCHAR2(200 CHAR) and, for example, that for a multibyte character string you might get as few as 50 characters in a VARCHAR(200 BYTE).

Note in this thread on the OTN Globalization Support Forum the advice of Oracle’s Sergiusz Wolicki that, contrary to the Oracle documentation (for which a documentation bug has been filed), it is in no way recommended to set NLS_LENGTH_SEMANTICS to CHAR in the parameter file.

This means that the recommended approach to to manually specifiy CHAR in column declarations when you want CHAR. Which makes sense.
And to use an ALTER SESSION command if those four extra characters (five with the space) seems like too much hard work.

Documentation in need of confidence booster

Some people attach a lot of weight to what the Oracle documentation says.

Rightly so, most of the time. It is a hugely significant resource.

But there are some documentation bugs / errors.

I was answering a character set posting on the OTN forums and when providing some links to the doco, I noticed the quote below in the official documentation:


I think these secions that describe encoding are out of date. For example, do Java and Microsoft Windows still use UCS-2? Or do they use UTF-16? I think UNIX uses UTF-32. So I am confused about what is the most current information for these sections.

How reassuring ;)

characterset inspection

In my experience, there are at least two subject matters that are guaranteed to come around to cause issues again and again – character sets and timezones – they never go away for long.

Maybe it’s because I’m a bit thick, maybe some of the intracacies of the subject matter are counter-intuitive or hard to grasp, or maybe it’s because the issues only come around every now and then, maybe a bit of all three, but I always feel that I’m starting at square 1 each time.

Anyway, not so long ago I blogged about a few local idiosyncracies in an application set up regarding charactersets and globalisation.

With so many touchpoints through the typical application techstack and possible confusion about conversions between client and server, the Oracle DUMP function is an excellent way of getting a clear picture of what has been stored in the database.

E.g.

SELECT DUMP(col1,16) FROM mytable;

The DUMP function returns a string containing the datatype, the length in bytes and the internal representation of the values in the column. In this example, the format mask of ’16’ returns the internal representation in hexadecimal notation.

In a more site specific example, the application concerned was displaying an article with the russian headline “Ирану грозят экономические санкции”. By looking at the internal representation in the database, we were able to dispell the myth that this was any sort of proper Cyrillic representation, either in UTF8 or Windows code page 1251:


select headline
, dump(headline,16)
from news
where origid = 6396135;

HEADLINE,DUMP(HEADLINE,16)
Èðàíó ãðîçÿò ýêîíîìè÷åñêèå ñàíêöèè
Typ=1 Len=65: c3,88,c3,b0,c3,a0,c3,ad,c3,b3,20,c3,a3,c3,b0,c3,ae,c3,a7,c3,bf,c3,b2,20,c3,bd,c3,aa,c3,ae,c3,ad,c3,ae,c3,ac,c3,a8,c3,b7,c3,a5,c3,b1,c3,aa,c3,a8,c3,a5,20,c3,b1,c3,a0,c3,ad,c3,aa,c3,b6,c3,a8,c3,a8

So, our 34 character headline is stored as a varchar2 (typ=1) of 65 bytes. The repetition of ‘c3′ and the disparity between headline length and byte storage length suggests that we are storing multi-byte characters.
For example, the first two bytes ‘c3, 88′ represent the character È in UTF8.
The first character in the published headline above ‘И’.
The proper utf8 representation of this is ‘d0, 98′
But the unicode representation of È is 00c8.
And the Windows codepage 1252 hex value is c8.
However, ‘c8′ is the hex code for ‘И’ in Windows codepage 1251.

Interesting. So, the application is a hack, is really storing corrupt data which is of no use to any other aplication and subsequently has to do something along the lines of getting this data out in UTF8 (c3,88) , translating to Unicode (00c8) , translating to Windows codepage 1252 equivalent (c8) and then again via Windows codepage 1251 transalation (И).

Import & NLS_LENGTH_SEMANTICS

I’ve been importing some data from an export of the production 8i instance to a 10g development instance.

This includes some multibyte data (for a little more on that see previous post).

However, if I just import the data then some of the rows are rejected because the length is too long for the column.

This sounds like a NLS_LENGTH_SEMANTICS thing whereby the column definitions of the tables concerned should be CHAR rather than BYTE.

I thought that if I changed the NLS_LENGTH_SEMANTICS setting for the database, then the import might create the tables with CHAR column lengths. But it didn’t, understandably probably.

So, the only alternative seems to be to create an indexfile from the import and edit it so that the column definitions are CHAR. And then manually create them. I need these reverse engineered creation scripts anyway, I suppose.

Doing globalisation the hard way?

I’m currently working with a database that supports global content. The majority of content is in English but the DB contains articles published in other Western European languages, in Cyrillic, Chinese Traditional, Chinese Simplified and Arabic.

This is an 8i Oracle database set up with a UFT8 characterset, which is a good start.

Unfortunately, however, from purely a database point of view, that is really the only good news because for all non-Western European languages, the data is stored as gobbledygook.

As is so often the case with a mature application, the Whys and Wherefores have been lost.  And there is a tenuous grip on the Hows as well. But on getting the data out of the database it is encoded by the application using Windows code page 1252. If the data is meant to be in one of the non-Western European languages, then a further encoding is applied – CP1256 for Arabic, GB3212 for Chinese simplified, Big5 for Chinese Traditional, and Windows 1251 for Cyrillic.

The long and the short of it is if you look at the foreign language data in the database and inspect it using the DUMP function, it’s all Western European. From a pure DB perspective, it’s garbage. Bits and bobs that only mean something to the application. Perhaps you could even say that to a certain extent this is one of the cases where the database is being used as a bit bucket.

So, the only way that you can make sense of a large proportion of the content is through the application.

Is this what is known as logically corrupt data?

Follow

Get every new post delivered to your Inbox.

Join 75 other followers