Re: Non-latin encoding and languages


Subject: Re: Non-latin encoding and languages
From: Paul Rohr (paul@abisource.com)
Date: Tue Feb 29 2000 - 18:15:23 CST


At 09:29 AM 2/29/00 -0600, sterwill@abisource.com wrote:
>Yes, this does sound like a problem. I have little experience with
>problems like these, but they'll need to be solved to make AbiWord
>usable to those outside of the Latin-1 set of languages.
>
>As usual, there seem to be two problems (and I'll try to keep them
>short, so people more qualified than me can jump in). The first
>is internationalization of the document, so that users can use their
>native character sets to write documents. The encoding issues here
>are specific to the fonts the user is using, and should map into
>Unicode space if we do our job correctly. We haven't done much
>work on this front for non-Latin-1 language, so I wouldn't be surprised
>if we have no fonts that can handle KOI8-R encodings.
>
>Actually, problem 1.5 is the importers and exporters, which will need
>to do things like character mapping conversions (like you mention).

Nice summary. To put it another way, there are potentially as many as 3
charsets in play here:

  input -- from keyboard, word document, etc.
  document -- we're *always* storing Unicode internally
  output -- fonts for display, and encodings for exporters

We know that the following special cases work properly:

1. Import/export of native ABW content should just work. We store
internally using UCS2 and both the importer and exporter crunch that down
into network-safe ampersand-encoded XML content.

2. Typing and display for Latin-1 languages should also just work. The
naive mappings from keyboard to internals to display all work trivially.

Henrik has started doing some of the necessary work to invoke the necessary
charset conversions so that incoming content gets converted correctly to
Unicode. I'm not sure whether we've done any of the display conversions yet
for non-Unicode fonts.

However, there's probably an additional special case which happens to work
right now, even though it shouldn't:

  input -- some funky code page
  document -- not really Unicode, because it never got converted
  output -- display using fonts in that same funky code page

In this case, it might appear to users that everything just works, but the
actual content being manipulated is "secretly" in that mystery code page,
which means that the document is almost guaranteed not to be portable.

This should definitely be fixed.

>The second problem, which I just realized AbiWord has, is that localizations
>need encodings too. All our current menu, toolbar, and string sets
>map into Latin-1 space, but for locales with more than one encoding,
>we may have to figure out a way to represent different them all.

Ick and double ick. This may not be a problem for the strings mechanism,
since the XML header mentions the charset used and expat presumably is
converting that content into Unicode for us. So long as the GUI font is
also in the right Unicode range, this may just work.

However, the string literals in the hardwired toolbar and menu tables will
only be portable for Latin-1 languages (where no charset conversion is ever
needed). Here again, the issue will be -- does the charset used for storing
the literal match the GUI font being used on the current platform? Chances
are that it doesn't. Blech.

This should definitely be fixed, too, but I'm not sure what the right
solution here will be.

We certainly don't want to start storing the same translation in multiple
charsets (one per platform) -- that'd be a maintenance nightmare.

Sigh.

Paul



This archive was generated by hypermail 2b25 : Tue Feb 29 2000 - 18:09:52 CST