Re: commit: abi: UTF8String class

From: Martin Sevior (msevior@mccubbin.ph.unimelb.edu.au)
Date: Sun Apr 21 2002 - 08:37:58 EDT

  • Next message: Martin Sevior: "Re: plugins, build fails with unresolved external symbols"

    On Sun, 21 Apr 2002, [iso-8859-1] Andrew Dunbar wrote:

    > --- Tomas Frydrych <tomas@frydrych.uklinux.net>
    > wrote: >
    > > > Andrew Dunbar <hippietrail@yahoo.com> wrote:
    > >
    > > > Well pretty soon we're going to need a real
    > > > replacement. Dom and I are both in favour of the
    > > > replacement being UTF-8 but some here seem to want
    > > > UTF-32.
    > >
    > > UTF-8 is an encoding scheme that is intended to
    > > allow Unicode
    > > communication between separate processes over 8-bit
    > > channels.
    > > For that it is great, but that's about the only
    > > thing it is really good
    > > for. UTF-8 processing is cumbersome, and as such it
    > > is completely
    > > unsuitable format to use for the piecetable. We need
    > > a fixed with
    > > encoding for that, such as the curent UCS-2, i.e.,
    > > UTF-32.
    >
    > Please back up these comments. A lot of people,
    > before
    > they are familiar with Unicode and UTF-8 seem to think
    > this. I did too. Then I read reams and reams of
    > newsgroups and mailing lists and FAQs. Now I know why
    > Qt, GTK, QNX, and others use UTF-8 internally.
    > People seem to think that because UTF-8 encodes
    > characters as variable length runs of bytes that this
    > is somehow computationally expensive to handle. Not
    > so. You can use existing 8-bit string functions on
    > it.
    > It is backwards compatible with ASCII. You can scan
    > forwards and backwards effortlessly. You can always
    > tell which character in a sequence a given byte
    > belongs to.
    > People think random access to these strings using
    > array operator will cost the earth. Guess what - very
    > little code access strings as arrays - especially in
    > a Word Processor. Of the code which does, very little
    > of that needs to. Even when you do perform lots of
    > array operations on a UTF-8 string, people have done
    > extensive tests showing that the cost is extremely
    > negligable - look in the Unicode literature and you
    > will find all this information.
    > People think that UCS-2, UTF-16, or UTF-32 mean we can
    > have perfect random access to strings because a
    > characters is always represented as a single word or
    > longword. Not so. UCS-2 should but this term is
    > often (by Microsoft) used to refer to UTF-16. UTF-16
    > uses a mechanism called "surrogates" whereby a single
    > character may need two words to represent it. There
    > goes your free array access. Even UTF-32 is not safe
    > from this. Because Unicode requires "combining
    > characters". This means that "á" may be represented
    > as "a" followed by a non-spacing "´" acute accent.
    > Some people think this is also silly. These people
    > need to go read all about Unicode before they embark
    > on seriously multilingual software. Vietnames is
    > possible to support without combining characters but
    > you won't be able to view the results because no
    > Vietnames fonts exist that work this way - they all
    > expect to use combining characters. Thai needs them.
    > Hindi needs them. All Indian/Indic languages need
    > them.
    >
    > So to sum up, the two arguments not to use UTF-8
    > internally are:
    >
    > 1) Array access is too slow.
    >
    > - This is not true and it is seldom needed.
    >
    > 2) UTF-8 means you have to handle a series of values
    > for a single on-screen character.
    >
    > - *All* Unicode encodings need this anyway!
    >
    > But look around the internet for better arguments and
    > better written arguments.
    >

    UTF-8 is great for communicating between the piecetable and the widgets. I
    think we should definately do this. What I don't want is for us to store
    our text as UTF-8 in the piecetable. We have a *LOT* of code that expects
    that every position in the piecetable corresponds to an extra letter of
    text.

    What I think we should do is store our unicode as UT_uint32 in the
    piecetable which can then be randomly accessed the same way we do things
    now.

    We just need to make a global UT_UCSChar => UT_UCSChar32 (==
    UT_uint32) plus some hardwired fixes for UT_unit16 about the place and
    some routines to do UT_UCSChar32 +> UTF_8 conversion for transporting
    unicode to the screen.

    By the way, glib2.0 (which we need for pango) has a unicode variable which
    is is just type UT_uint32 and plenty of UT_uint32 <=> UTF_8 conversion and
    conveince routines.

    Cheers

    Martin



    This archive was generated by hypermail 2.1.4 : Sun Apr 21 2002 - 08:39:52 EDT