Re: i18n of abiword -- combining characters


Subject: Re: i18n of abiword -- combining characters
From: Paul Rohr (paul@abisource.com)
Date: Fri Jan 14 2000 - 15:14:09 CST


Thai, like some other languages, allows a sequence of individual characters
to be typed to form a single glyph. Usually this takes the form of a base
character which is further modified by other combining characters.

1. Character sequence normalization. (reasonable)
---------------------------------------------------
Thus, there needs to be work done (probably at input time) to normalize
those sequences of combining characters, and perhaps ignore invalid ones.
(Otherwise, the variant sequences will make features like spell-check
prohibitively unreliable.)

The Thai-specific algorithms here seem to be well-defined, although I'm not
sure whether these get applied before or after step 1.

2. Combining characters -- position. (???)
--------------------------------------------
The current code assumes that every Unicode character will occupy one cell
of display space of a known width. However, languages like Thai render
sequences of several characters into the same display cell. For a WYSIWYG
word processor, changing this fundamental assumption has a variety of
implications.

2a. selection semantics -- When you select one glyph, you select one or
more characters. All edit operations that currently affect one "character"
will need to be reexamined to see whether they should affect the glyph, or
one of the component characters in the glyph. For example, does backspace
delete all characters in the glyph, or does it remove the last combining
character, changing the glyph but maintaining the cursor position?

2b. cursor semantics -- Similarly, moving the cursor one glyph to the right
means moving one or more characters to the right.

3. Combining characters -- width. (???)
-----------------------------------------
More fundamentally, our charwidth-handling logic will need to be expanded to
handle combining characters. In most cases, the width of the first
character in the sequence determines the width of the entire cell. However,
some combinations make the entire cell wider.

Currently, the formatter maintains a per-character array of widths. For
combining characters, we can't just add those widths to the total width of
the word. Instead, they'll somehow need to be folded into the width
calculation for the resulting glyph. The exact algoritm needed here depends
on how information about the "width" of combining characters is stored in
the font.

I'm totally just guessing now, but two simple approaches come to mind:

3a. Sum. If we assume that most combining characters don't affect the
glyph width, store their charwidth as zero, and then calculate the glyph
width as the sum of all the characters. (Presumably, then, if some
combining characters do affect the overall glyph width, we'd have to store
the difference.)

3b. Max. Alternatively, if the cell width of each combining character
indicates the width of the resulting cell when this combining character is
used, then we might instead be able to just calculate the glyph width as the
maximum of its constitutent charwidths.

However, I really don't know what I'm talking about here. More
investigation is definitely needed, depending on the capabilities of each
platform's GR_Graphics::measureString() implementation.

4. Combining characters -- rendering. (???, platform-specific)
----------------------------------------------------------------
On each platform, someone will need to investigate whether the
text-rendering primitives know how to properly combine a character sequence
into a single glyph. If so, drawing should be pretty easy. If not, adding
logic to do all that rendering from the constituent glyphs in the font may
be difficult.

Paul



This archive was generated by hypermail 2b25 : Fri Jan 14 2000 - 15:08:50 CST