Re: commit: abi: UTF8String class

From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Sat Apr 20 2002 - 23:35:52 EDT

  • Next message: phearbear: "Re: commit: abi: UTF8String class"

     --- Tomas Frydrych <tomas@frydrych.uklinux.net>
    wrote: >
    > > Andrew Dunbar <hippietrail@yahoo.com> wrote:
    >
    > > Well pretty soon we're going to need a real
    > > replacement. Dom and I are both in favour of the
    > > replacement being UTF-8 but some here seem to want
    > > UTF-32.
    >
    > UTF-8 is an encoding scheme that is intended to
    > allow Unicode
    > communication between separate processes over 8-bit
    > channels.
    > For that it is great, but that's about the only
    > thing it is really good
    > for. UTF-8 processing is cumbersome, and as such it
    > is completely
    > unsuitable format to use for the piecetable. We need
    > a fixed with
    > encoding for that, such as the curent UCS-2, i.e.,
    > UTF-32.

    Please back up these comments. A lot of people,
    before
    they are familiar with Unicode and UTF-8 seem to think
    this. I did too. Then I read reams and reams of
    newsgroups and mailing lists and FAQs. Now I know why
    Qt, GTK, QNX, and others use UTF-8 internally.
    People seem to think that because UTF-8 encodes
    characters as variable length runs of bytes that this
    is somehow computationally expensive to handle. Not
    so. You can use existing 8-bit string functions on
    it.
    It is backwards compatible with ASCII. You can scan
    forwards and backwards effortlessly. You can always
    tell which character in a sequence a given byte
    belongs to.
    People think random access to these strings using
    array operator will cost the earth. Guess what - very
    little code access strings as arrays - especially in
    a Word Processor. Of the code which does, very little
    of that needs to. Even when you do perform lots of
    array operations on a UTF-8 string, people have done
    extensive tests showing that the cost is extremely
    negligable - look in the Unicode literature and you
    will find all this information.
    People think that UCS-2, UTF-16, or UTF-32 mean we can
    have perfect random access to strings because a
    characters is always represented as a single word or
    longword. Not so. UCS-2 should but this term is
    often (by Microsoft) used to refer to UTF-16. UTF-16
    uses a mechanism called "surrogates" whereby a single
    character may need two words to represent it. There
    goes your free array access. Even UTF-32 is not safe
    from this. Because Unicode requires "combining
    characters". This means that "á" may be represented
    as "a" followed by a non-spacing "´" acute accent.
    Some people think this is also silly. These people
    need to go read all about Unicode before they embark
    on seriously multilingual software. Vietnames is
    possible to support without combining characters but
    you won't be able to view the results because no
    Vietnames fonts exist that work this way - they all
    expect to use combining characters. Thai needs them.
    Hindi needs them. All Indian/Indic languages need
    them.

    So to sum up, the two arguments not to use UTF-8
    internally are:

    1) Array access is too slow.

    - This is not true and it is seldom needed.

    2) UTF-8 means you have to handle a series of values
       for a single on-screen character.

    - *All* Unicode encodings need this anyway!

    But look around the internet for better arguments and
    better written arguments.

    Andrew Dunbar.

    =====
    http://linguaphile.sourceforge.net http://www.abisource.com

    __________________________________________________
    Do You Yahoo!?
    Everything you'll ever need on one web page
    from News and Sport to Email and Music Charts
    http://uk.my.yahoo.com



    This archive was generated by hypermail 2.1.4 : Sat Apr 20 2002 - 23:36:56 EDT