Re: i18n of abiword -- combining characters (Thai)


Subject: Re: i18n of abiword -- combining characters (Thai)
From: Pruet Boonma (pruet@eng.cmu.ac.th)
Date: Sun Jan 16 2000 - 03:07:10 CST


Quoting Paul Rohr <paul@abisource.com>:

> At 03:43 PM 1/14/00 -0800, Leonard Rosenthol wrote:
> >At 1:14 PM -0800 1/14/00, Paul Rohr wrote:
> >>1. Character sequence normalization. (reasonable)
> >>---------------------------------------------------
> >>Thus, there needs to be work done (probably at input time) to normalize
> >>those sequences of combining characters, and perhaps ignore invalid ones.
> >
> > If you use the standard OS input methods, they will handle
> >all this for you - in fact, they will also handle a number of other
> >input issues that are pretty complex for some languages (especially
> >CJK).
>
> Of course, we\'d love to take advantage of OS-level input methods wherever
> possible, but I\'m less confident than you are that these will be sufficient
>
> in all cases.
>
> I\'m eager to see the code that proves me wrong. :-)

I think so that not all OS have fully i18n support. Like my standard OS, Linux, we have to
add Thai support to manything. Now we (tis620-cp group in Thailand) try to add basic support
for Thai in glibc locale level and hope that someday it will be help us to working easier.

>
> >>(Otherwise, the variant sequences will make features like spell-check
> >>prohibitively unreliable.)
> >
> > And also search & replace. The whole \"combined characters\"
> >in Unicode issue is an interesting one, especially when doing things
> >like regular expression searches.
>
> Yep. That\'s another good reason for normalization.

Yes I think so. I have add normalization for Thai in MySQL DBMS and it\'s
a simple rule-based normalization. So I don\'t think that normalization for
input handle should not so hard in AbiWord too.

>
> >>2. Combining characters -- position. (???)
> >>--------------------------------------------
> >>The current code assumes that every Unicode character will occupy one
> cell
> >>of display space of a known width. However, languages like Thai render
> >>sequences of several characters into the same display cell.
> >
> > Since Unicode only has a single code point for any valid
> >glyph, your input handler should be converting the multiple
> >characters into the new composite glyph value and then you only have
> >one character to display.
>
> Some languages may indeed have code points for all the composite glyphs
> needed. However, as far as I can tell, this is *not* true for Thai.
>
> http://charts.unicode.org/Unicode.charts/normal/U0E00.html
>
> As far as I can tell, the following combining characters need to be
> composited with one or more other characters at rendering time:
>
> 0E31
> 0E34 - 0E3A
> 0E47 - 0E4E

If you look at homepage I memtion before (http://www.inet.co.th/cyberclub/trin/thairef/) it have
reference for all Thai implementation problem. Like many langauge.Thai need to have many character
in same cell , sometime 3-4 character in same cell.

Pruet

>
> Am I missing something here?
>
> >>4. Combining characters -- rendering. (???, platform-specific)
> >>----------------------------------------------------------------
> >>On each platform, someone will need to investigate whether the
> >>text-rendering primitives know how to properly combine a character
> sequence
> >>into a single glyph. If so, drawing should be pretty easy. If not,
> adding
> >>logic to do all that rendering from the constituent glyphs in the font
> may
> >>be difficult.
> >>
> > Again, if you use the single combined glyph code point, it
> >should work just fine when rendered.
>
> Again, this sounds wonderfully convenient, but I\'m not sure it\'s always
> true.
>
> Paul
>



This archive was generated by hypermail 2b25 : Sun Jan 16 2000 - 03:06:08 CST