Re: commit: abi: UTF8String class

From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Sun Apr 21 2002 - 11:01:05 EDT

Next message: Karl Ove Hufthammer: "Re: Ready for the Big Time!"

Previous message: Martin Sevior: "Next Generation Containers."
In reply to: Martin Sevior: "Re: commit: abi: UTF8String class"
Next in thread: Karl Ove Hufthammer: "Re: commit: abi: UTF8String class"
Next in thread: Joaquin Cuenca Abela: "Re: commit: abi: UTF8String class"
Reply: Karl Ove Hufthammer: "Re: commit: abi: UTF8String class"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

--- Martin Sevior
<msevior@mccubbin.ph.unimelb.edu.au> wrote: > > >
> > > UTF-8 is great for communicating between the
> > > piecetable and the widgets. I
> > > think we should definately do this. What I don't
> > > want is for us to store
> > > our text as UTF-8 in the piecetable. We have a
> *LOT*
> > > of code that expects
> > > that every position in the piecetable
> corresponds to
> > > an extra letter of text.
> >
> > How is this going to work for languages that need
> > combining characters? Isn't it going to need to
> be
> > changed anyway? Isn't now the time to do this
> > re-design?
>
> I don't understand this. Doesn't every glyph have a
> unique unicode code
> point? If so we still have a one-to one mapping of
> glyph to text location.

In my previous posts I described combining characters.
Basically instead of "a with acute accent" being one
codepoint, it can be two: "a" + "combining accute
accent". Some languages would need huge numbers of
characters to represent all possible combinations of
base character plus one or more diacritical mark.
They forced Vietnamese to work this way before they
realized it's a bad idea. Thai and Indian langauges
need combining characters.

> > > What I think we should do is store our unicode
> as
> > > UT_uint32 in the
> > > piecetable which can then be randomly accessed
> the
> > > same way we do things now.
> >
> > To randomly access what the user sees as a
> character
> > or to randomly acces what is internally one
> codepoint?
>
> OK I don't understand. Are you saying that two code
> points in a row map to
> a different glph? If so why not just insert the code
> point for this glyph?

*May* map to a different glyph - but glyph is not the
correct term, I believe. You could have a c with an
acute accent and a cedilla, for instance, which would
need three codepoints but appear on the screen to be
one character. I don't have the proper definition for
glyph handy sorry.

> > These are not the same. But I don't know the
> > piecetable either so maybe it is the right thing
> to
> > do.
> > As long as we are thinking about it.
>
> Certainly the structure of the code makes lots of
> assumptions of one
> PT_DocPosition, one glyph. If unicode was at all
> sane this should not be a
> problem. Are you telling me that unicode is not sane
> and that certain
> glyphs can only be generated if two 32 bit numbers
> are presented
> consecutively?

Depends on what is sane. Work out how many fully
diacriticized characters would be needed for all the
world's languages and tell me if that's saner than the
Unicode combining character way of doing things...

Andrew Dunbar.

> Cheers
>
> Martin
>
>

=====
http://linguaphile.sourceforge.net http://www.abisource.com

__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com

Next message: Karl Ove Hufthammer: "Re: Ready for the Big Time!"
Previous message: Martin Sevior: "Next Generation Containers."
In reply to: Martin Sevior: "Re: commit: abi: UTF8String class"
Next in thread: Karl Ove Hufthammer: "Re: commit: abi: UTF8String class"
Next in thread: Joaquin Cuenca Abela: "Re: commit: abi: UTF8String class"
Reply: Karl Ove Hufthammer: "Re: commit: abi: UTF8String class"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.4 : Sun Apr 21 2002 - 11:02:07 EDT