characterset inspection

In my experience, there are at least two subject matters that are guaranteed to come around to cause issues again and again – character sets and timezones – they never go away for long.

Maybe it’s because I’m a bit thick, maybe some of the intracacies of the subject matter are counter-intuitive or hard to grasp, or maybe it’s because the issues only come around every now and then, maybe a bit of all three, but I always feel that I’m starting at square 1 each time.

Anyway, not so long ago I blogged about a few local idiosyncracies in an application set up regarding charactersets and globalisation.

With so many touchpoints through the typical application techstack and possible confusion about conversions between client and server, the Oracle DUMP function is an excellent way of getting a clear picture of what has been stored in the database.

E.g.

SELECT DUMP(col1,16) FROM mytable;

The DUMP function returns a string containing the datatype, the length in bytes and the internal representation of the values in the column. In this example, the format mask of ’16’ returns the internal representation in hexadecimal notation.

In a more site specific example, the application concerned was displaying an article with the russian headline “Ирану грозят экономические санкции”. By looking at the internal representation in the database, we were able to dispell the myth that this was any sort of proper Cyrillic representation, either in UTF8 or Windows code page 1251:


select headline
, dump(headline,16)
from news
where origid = 6396135;

HEADLINE,DUMP(HEADLINE,16)
Èðàíó ãðîçÿò ýêîíîìè÷åñêèå ñàíêöèè
Typ=1 Len=65: c3,88,c3,b0,c3,a0,c3,ad,c3,b3,20,c3,a3,c3,b0,c3,ae,c3,a7,c3,bf,c3,b2,20,c3,bd,c3,aa,c3,ae,c3,ad,c3,ae,c3,ac,c3,a8,c3,b7,c3,a5,c3,b1,c3,aa,c3,a8,c3,a5,20,c3,b1,c3,a0,c3,ad,c3,aa,c3,b6,c3,a8,c3,a8

So, our 34 character headline is stored as a varchar2 (typ=1) of 65 bytes. The repetition of ‘c3′ and the disparity between headline length and byte storage length suggests that we are storing multi-byte characters.
For example, the first two bytes ‘c3, 88′ represent the character È in UTF8.
The first character in the published headline above ‘И’.
The proper utf8 representation of this is ‘d0, 98′
But the unicode representation of È is 00c8.
And the Windows codepage 1252 hex value is c8.
However, ‘c8′ is the hex code for ‘И’ in Windows codepage 1251.

Interesting. So, the application is a hack, is really storing corrupt data which is of no use to any other aplication and subsequently has to do something along the lines of getting this data out in UTF8 (c3,88) , translating to Unicode (00c8) , translating to Windows codepage 1252 equivalent (c8) and then again via Windows codepage 1251 transalation (И).

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 75 other followers

%d bloggers like this: