Doing globalisation the hard way?

I’m currently working with a database that supports global content. The majority of content is in English but the DB contains articles published in other Western European languages, in Cyrillic, Chinese Traditional, Chinese Simplified and Arabic.

This is an 8i Oracle database set up with a UFT8 characterset, which is a good start.

Unfortunately, however, from purely a database point of view, that is really the only good news because for all non-Western European languages, the data is stored as gobbledygook.

As is so often the case with a mature application, the Whys and Wherefores have been lost.  And there is a tenuous grip on the Hows as well. But on getting the data out of the database it is encoded by the application using Windows code page 1252. If the data is meant to be in one of the non-Western European languages, then a further encoding is applied – CP1256 for Arabic, GB3212 for Chinese simplified, Big5 for Chinese Traditional, and Windows 1251 for Cyrillic.

The long and the short of it is if you look at the foreign language data in the database and inspect it using the DUMP function, it’s all Western European. From a pure DB perspective, it’s garbage. Bits and bobs that only mean something to the application. Perhaps you could even say that to a certain extent this is one of the cases where the database is being used as a bit bucket.

So, the only way that you can make sense of a large proportion of the content is through the application.

Is this what is known as logically corrupt data?

Advertisements

3 Responses to Doing globalisation the hard way?

  1. Mghong says:

    As we are running a global application over here , most of our database is using AL32UTF8
    .

    but recently there is an issue when a user sitting at Spain create new user id in oracle
    user id password
    GFORÇA GFORÇA

    but they are able to login either GFORÇA or GFORCA .

    So i start to wonder did oracle take garbage in , and it suppose to take garbage out as well… wonder why it cover Ç to C automatically ???

  2. Mghong says:

    i just post a query in my blog and search around orana to see whether i can find a answer before we help customer to log a TAR.

  3. dombrooks says:

    That’s interesting.

    Off the top of my head, I don’t know but I’ve got a couple of vaguely familiar thoughts.

    I might take a look later but let me know how it goes.

    It raises some interesting questions though and maybe the answers that lead to make it seem obvious.

    If you were allowed to store UTF8 characters in the data dictionary, there could be all sorts of problems.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: