Re: commit: abi: UTF8String class

From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Sun Apr 21 2002 - 10:09:15 EDT

  • Next message: Karl Ove Hufthammer: "Re: Ready for the Big Time!"

     --- Martin Sevior
    <msevior@mccubbin.ph.unimelb.edu.au> wrote: >
    >
    > On Sun, 21 Apr 2002, [iso-8859-1] Andrew Dunbar
    > wrote:
    >
    > > --- Tomas Frydrych <tomas@frydrych.uklinux.net>
    > > wrote: >
    > > > > Andrew Dunbar <hippietrail@yahoo.com> wrote:
    > > >
    > > > > Well pretty soon we're going to need a real
    > > > > replacement. Dom and I are both in favour of
    > the
    > > > > replacement being UTF-8 but some here seem to
    > want
    > > > > UTF-32.
    > > >
    > > > UTF-8 is an encoding scheme that is intended to
    > > > allow Unicode
    > > > communication between separate processes over
    > 8-bit
    > > > channels.
    > > > For that it is great, but that's about the only
    > > > thing it is really good
    > > > for. UTF-8 processing is cumbersome, and as such
    > it
    > > > is completely
    > > > unsuitable format to use for the piecetable. We
    > need
    > > > a fixed with
    > > > encoding for that, such as the curent UCS-2,
    > i.e.,
    > > > UTF-32.
    > >
    > > Please back up these comments. A lot of people,
    > > before
    > > they are familiar with Unicode and UTF-8 seem to
    > think
    > > this. I did too. Then I read reams and reams of
    > > newsgroups and mailing lists and FAQs. Now I know
    > why
    > > Qt, GTK, QNX, and others use UTF-8 internally.
    > > People seem to think that because UTF-8 encodes
    > > characters as variable length runs of bytes that
    > this
    > > is somehow computationally expensive to handle.
    > Not
    > > so. You can use existing 8-bit string functions
    > on
    > > it.
    > > It is backwards compatible with ASCII. You can
    > scan
    > > forwards and backwards effortlessly. You can
    > always
    > > tell which character in a sequence a given byte
    > > belongs to.
    > > People think random access to these strings using
    > > array operator will cost the earth. Guess what -
    > very
    > > little code access strings as arrays - especially
    > in
    > > a Word Processor. Of the code which does, very
    > little
    > > of that needs to. Even when you do perform lots
    > of
    > > array operations on a UTF-8 string, people have
    > done
    > > extensive tests showing that the cost is extremely
    > > negligable - look in the Unicode literature and
    > you
    > > will find all this information.
    > > People think that UCS-2, UTF-16, or UTF-32 mean we
    > can
    > > have perfect random access to strings because a
    > > characters is always represented as a single word
    > or
    > > longword. Not so. UCS-2 should but this term is
    > > often (by Microsoft) used to refer to UTF-16.
    > UTF-16
    > > uses a mechanism called "surrogates" whereby a
    > single
    > > character may need two words to represent it.
    > There
    > > goes your free array access. Even UTF-32 is not
    > safe
    > > from this. Because Unicode requires "combining
    > > characters". This means that "á" may be
    > represented
    > > as "a" followed by a non-spacing "´" acute accent.
    > > Some people think this is also silly. These
    > people
    > > need to go read all about Unicode before they
    > embark
    > > on seriously multilingual software. Vietnames is
    > > possible to support without combining characters
    > but
    > > you won't be able to view the results because no
    > > Vietnames fonts exist that work this way - they
    > all
    > > expect to use combining characters. Thai needs
    > them.
    > > Hindi needs them. All Indian/Indic languages need
    > > them.
    > >
    > > So to sum up, the two arguments not to use UTF-8
    > > internally are:
    > >
    > > 1) Array access is too slow.
    > >
    > > - This is not true and it is seldom needed.
    > >
    > > 2) UTF-8 means you have to handle a series of
    > values
    > > for a single on-screen character.
    > >
    > > - *All* Unicode encodings need this anyway!
    > >
    > > But look around the internet for better arguments
    > and
    > > better written arguments.
    > >
    >
    >
    > UTF-8 is great for communicating between the
    > piecetable and the widgets. I
    > think we should definately do this. What I don't
    > want is for us to store
    > our text as UTF-8 in the piecetable. We have a *LOT*
    > of code that expects
    > that every position in the piecetable corresponds to
    > an extra letter of text.

    How is this going to work for languages that need
    combining characters? Isn't it going to need to be
    changed anyway? Isn't now the time to do this
    re-design?

    > What I think we should do is store our unicode as
    > UT_uint32 in the
    > piecetable which can then be randomly accessed the
    > same way we do things now.

    To randomly access what the user sees as a character
    or to randomly acces what is internally one codepoint?
    These are not the same. But I don't know the
    piecetable either so maybe it is the right thing to
    do.
    As long as we are thinking about it.

    > We just need to make a global UT_UCSChar =>
    > UT_UCSChar32 (==
    > UT_uint32) plus some hardwired fixes for UT_unit16
    > about the place and
    > some routines to do UT_UCSChar32 +> UTF_8 conversion
    > for transporting
    > unicode to the screen.

    We already do this anyway. We can probably improve it
    but different areas are going to need different
    encodings. The GUI, the filesystem, the document, and
    the piecetable, the spellchecker, the exporter, may
    all need different encodings. This can all be handled
    by our string classes and iconv. Some of those are
    already handled nicely now.

    > By the way, glib2.0 (which we need for pango) has a
    > unicode variable which
    > is is just type UT_uint32 and plenty of UT_uint32
    > <=> UTF_8 conversion and
    > conveince routines.

    Interestingly I just noticed on the GTK I18N mailing
    list archives that Owen says Pango uses UTF-8
    internally but he wishes he'd gone with UTF-32.
    But Pango ought to be doing lower-level manipulations
    than we're going to need, and when we do need that
    stuff we'll probably be using Pango anyway. Maybe we
    should actually ask Owen for advice (:

    Andrew Dunbar.

    > Cheers
    >
    > Martin
    >
    >

    =====
    http://linguaphile.sourceforge.net http://www.abisource.com

    __________________________________________________
    Do You Yahoo!?
    Everything you'll ever need on one web page
    from News and Sport to Email and Music Charts
    http://uk.my.yahoo.com



    This archive was generated by hypermail 2.1.4 : Sun Apr 21 2002 - 10:10:20 EDT