Re: i18n of abiword -- representations of strings


Subject: Re: i18n of abiword -- representations of strings
From: Pierre Abbat (phma@oltronics.net)
Date: Mon Jan 17 2000 - 13:37:01 CST


It looks like we will need five representations of a string:
1. For alphabetization.
2. The sequence of letters.
3. Keystrokes.
4. The Unicode representation.
5. The sequence of glyphs.

2->1 is lossy in several languages; for instance, in Hungarian, accents are
ignored in alphabetization, but umlauts are not (õ becomes ö and û becomes ü).

4->2, even in European languages, is not always straightforward. Spanish,
Welsh, and Hungarian have digraphs, which are single letters written as two.
Hungarian has the additional complication that a double digraph is written as
three characters, such as "asszony" (woman) where "ssz" is "sz-sz" and is so
broken when it falls at the end of the line. But an exception to that rule is
"tizennyolc" (eighteen) which is a compound "tizen-nyolc", not "tizeny-nyolc".

4->5 is straightforward in European languages; in Indian and SE Asian scripts
it is the problem we have been discussing.

We should be able to convert any sequence of letters, even nonsensical ones
like four R's followed by two anusvaras, to Unicode, to glyphs, and back
without losing anything. Keystrokes are another matter; if someone types a
short I in nagari with no following consonant, it can be converted to Unicode
and a glyph, but not to a letter because it would move after the following
consonant, of which there is none.

phma



This archive was generated by hypermail 2b25 : Mon Jan 17 2000 - 13:57:08 CST