Re: i18n of abiword -- combining characters


Subject: Re: i18n of abiword -- combining characters
From: Paul Rohr (paul@abisource.com)
Date: Sat Jan 15 2000 - 13:57:10 CST


At 03:43 PM 1/14/00 -0800, Leonard Rosenthol wrote:
>At 1:14 PM -0800 1/14/00, Paul Rohr wrote:
>>1. Character sequence normalization. (reasonable)
>>---------------------------------------------------
>>Thus, there needs to be work done (probably at input time) to normalize
>>those sequences of combining characters, and perhaps ignore invalid ones.
>
> If you use the standard OS input methods, they will handle
>all this for you - in fact, they will also handle a number of other
>input issues that are pretty complex for some languages (especially
>CJK).

Of course, we'd love to take advantage of OS-level input methods wherever
possible, but I'm less confident than you are that these will be sufficient
in all cases.

I'm eager to see the code that proves me wrong. :-)

>>(Otherwise, the variant sequences will make features like spell-check
>>prohibitively unreliable.)
>
> And also search & replace. The whole "combined characters"
>in Unicode issue is an interesting one, especially when doing things
>like regular expression searches.

Yep. That's another good reason for normalization.

>>2. Combining characters -- position. (???)
>>--------------------------------------------
>>The current code assumes that every Unicode character will occupy one cell
>>of display space of a known width. However, languages like Thai render
>>sequences of several characters into the same display cell.
>
> Since Unicode only has a single code point for any valid
>glyph, your input handler should be converting the multiple
>characters into the new composite glyph value and then you only have
>one character to display.

Some languages may indeed have code points for all the composite glyphs
needed. However, as far as I can tell, this is *not* true for Thai.

  http://charts.unicode.org/Unicode.charts/normal/U0E00.html

As far as I can tell, the following combining characters need to be
composited with one or more other characters at rendering time:

  0E31
  0E34 - 0E3A
  0E47 - 0E4E

Am I missing something here?

>>4. Combining characters -- rendering. (???, platform-specific)
>>----------------------------------------------------------------
>>On each platform, someone will need to investigate whether the
>>text-rendering primitives know how to properly combine a character sequence
>>into a single glyph. If so, drawing should be pretty easy. If not, adding
>>logic to do all that rendering from the constituent glyphs in the font may
>>be difficult.
>>
> Again, if you use the single combined glyph code point, it
>should work just fine when rendered.

Again, this sounds wonderfully convenient, but I'm not sure it's always
true.

Paul



This archive was generated by hypermail 2b25 : Sat Jan 15 2000 - 13:51:51 CST