Re: commit: abi: UTF8String class

From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Sun Apr 21 2002 - 11:01:05 EDT

  • Next message: Karl Ove Hufthammer: "Re: Ready for the Big Time!"

     --- Martin Sevior
    <msevior@mccubbin.ph.unimelb.edu.au> wrote: > > >
    > > > UTF-8 is great for communicating between the
    > > > piecetable and the widgets. I
    > > > think we should definately do this. What I don't
    > > > want is for us to store
    > > > our text as UTF-8 in the piecetable. We have a
    > *LOT*
    > > > of code that expects
    > > > that every position in the piecetable
    > corresponds to
    > > > an extra letter of text.
    > >
    > > How is this going to work for languages that need
    > > combining characters? Isn't it going to need to
    > be
    > > changed anyway? Isn't now the time to do this
    > > re-design?
    >
    > I don't understand this. Doesn't every glyph have a
    > unique unicode code
    > point? If so we still have a one-to one mapping of
    > glyph to text location.

    In my previous posts I described combining characters.
    Basically instead of "a with acute accent" being one
    codepoint, it can be two: "a" + "combining accute
    accent". Some languages would need huge numbers of
    characters to represent all possible combinations of
    base character plus one or more diacritical mark.
    They forced Vietnamese to work this way before they
    realized it's a bad idea. Thai and Indian langauges
    need combining characters.

    > > > What I think we should do is store our unicode
    > as
    > > > UT_uint32 in the
    > > > piecetable which can then be randomly accessed
    > the
    > > > same way we do things now.
    > >
    > > To randomly access what the user sees as a
    > character
    > > or to randomly acces what is internally one
    > codepoint?
    >
    > OK I don't understand. Are you saying that two code
    > points in a row map to
    > a different glph? If so why not just insert the code
    > point for this glyph?

    *May* map to a different glyph - but glyph is not the
    correct term, I believe. You could have a c with an
    acute accent and a cedilla, for instance, which would
    need three codepoints but appear on the screen to be
    one character. I don't have the proper definition for
    glyph handy sorry.

    > > These are not the same. But I don't know the
    > > piecetable either so maybe it is the right thing
    > to
    > > do.
    > > As long as we are thinking about it.
    >
    > Certainly the structure of the code makes lots of
    > assumptions of one
    > PT_DocPosition, one glyph. If unicode was at all
    > sane this should not be a
    > problem. Are you telling me that unicode is not sane
    > and that certain
    > glyphs can only be generated if two 32 bit numbers
    > are presented
    > consecutively?

    Depends on what is sane. Work out how many fully
    diacriticized characters would be needed for all the
    world's languages and tell me if that's saner than the
    Unicode combining character way of doing things...

    Andrew Dunbar.

    > Cheers
    >
    > Martin
    >
    >

    =====
    http://linguaphile.sourceforge.net http://www.abisource.com

    __________________________________________________
    Do You Yahoo!?
    Everything you'll ever need on one web page
    from News and Sport to Email and Music Charts
    http://uk.my.yahoo.com



    This archive was generated by hypermail 2.1.4 : Sun Apr 21 2002 - 11:02:07 EDT