Re: commit: abi: UTF8String class

From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Sun Apr 21 2002 - 00:28:00 EDT

  • Next message: F J Franklin: "Re: commit: abi: UTF8String class"

     --- phearbear <phearbear@home.se> wrote: > Andrew
    Dunbar wrote:
    >
    > > --- Tomas Frydrych <tomas@frydrych.uklinux.net>
    > >wrote: >
    > >
    > >>>Andrew Dunbar <hippietrail@yahoo.com> wrote:
    > >>>
    > >>>Well pretty soon we're going to need a real
    > >>>replacement. Dom and I are both in favour of the
    > >>>replacement being UTF-8 but some here seem to
    > want
    > >>>UTF-32.
    > >>>
    > >>UTF-8 is an encoding scheme that is intended to
    > >>allow Unicode
    > >>communication between separate processes over
    > 8-bit
    > >>channels.
    > >>For that it is great, but that's about the only
    > >>thing it is really good
    > >>for. UTF-8 processing is cumbersome, and as such
    > it
    > >>is completely
    > >>unsuitable format to use for the piecetable. We
    > need
    > >>a fixed with
    > >>encoding for that, such as the curent UCS-2, i.e.,
    > >>UTF-32.
    > >>
    > >
    > >Please back up these comments. A lot of people,
    > >before
    > >they are familiar with Unicode and UTF-8 seem to
    > think
    > >this. I did too. Then I read reams and reams of
    > >newsgroups and mailing lists and FAQs. Now I know
    > why
    > >Qt, GTK, QNX, and others use UTF-8 internally.
    > >People seem to think that because UTF-8 encodes
    > >characters as variable length runs of bytes that
    > this
    > >is somehow computationally expensive to handle.
    > Not
    > >so. You can use existing 8-bit string functions on
    > >it.
    > >It is backwards compatible with ASCII. You can
    > scan
    > >forwards and backwards effortlessly. You can
    > always
    > >tell which character in a sequence a given byte
    > >belongs to.
    > >People think random access to these strings using
    > >array operator will cost the earth. Guess what -
    > very
    > >little code access strings as arrays - especially
    > in
    > >a Word Processor. Of the code which does, very
    > little
    > >of that needs to. Even when you do perform lots of
    > >array operations on a UTF-8 string, people have
    > done
    > >extensive tests showing that the cost is extremely
    > >negligable - look in the Unicode literature and you
    > >will find all this information.
    > >People think that UCS-2, UTF-16, or UTF-32 mean we
    > can
    > >have perfect random access to strings because a
    > >characters is always represented as a single word
    > or
    > >longword. Not so. UCS-2 should but this term is
    > >often (by Microsoft) used to refer to UTF-16.
    > UTF-16
    > >uses a mechanism called "surrogates" whereby a
    > single
    > >character may need two words to represent it.
    > There
    > >goes your free array access. Even UTF-32 is not
    > safe
    > >from this. Because Unicode requires "combining
    > >characters". This means that "á" may be
    > represented
    > >as "a" followed by a non-spacing "´" acute accent.
    > >Some people think this is also silly. These people
    > >need to go read all about Unicode before they
    > embark
    > >on seriously multilingual software. Vietnames is
    > >possible to support without combining characters
    > but
    > >you won't be able to view the results because no
    > >Vietnames fonts exist that work this way - they all
    > >expect to use combining characters. Thai needs
    > them.
    > >Hindi needs them. All Indian/Indic languages need
    > >them.
    > >
    > >So to sum up, the two arguments not to use UTF-8
    > >internally are:
    > >
    > >1) Array access is too slow.
    > >
    > >- This is not true and it is seldom needed.
    > >
    > >2) UTF-8 means you have to handle a series of
    > values
    > > for a single on-screen character.
    > >
    > >- *All* Unicode encodings need this anyway!
    > >
    > >But look around the internet for better arguments
    > and
    > >better written arguments.
    > >
    > >Andrew Dunbar.
    > >
    > >=====
    > >http://linguaphile.sourceforge.net
    > http://www.abisource.com
    > >
    > >__________________________________________________
    > >Do You Yahoo!?
    > >Everything you'll ever need on one web page
    > >from News and Sport to Email and Music Charts
    > >http://uk.my.yahoo.com
    > >
    > >
    > Hi
    >
    > Excuse my lazyness, but scanning through all
    > unicode.org isn't really
    > what i like to spend my week on ;) Any special
    > articles you recommend us
    > to read?

    Unfortunately I don't have my own computer or
    internet connection yet since I just got back from
    travelling the world and need a job. Hopefully some
    of the devs here besides me also know a lot about
    Unicode or are researching it. I probably have some
    bookmarks on my computer but it's on the other side of
    the country. Doing google or google groups searches
    for Unicode and another relevant term should turn up
    plenty of stuff.

    This is incredibly important now that we're past 1.0.
    I really hope my lack of ability to post exact
    references here doesn't mean we start implementing
    things before we know the issues thoroughly.

    Andrew Dunbar.

    =====
    http://linguaphile.sourceforge.net http://www.abisource.com

    __________________________________________________
    Do You Yahoo!?
    Everything you'll ever need on one web page
    from News and Sport to Email and Music Charts
    http://uk.my.yahoo.com



    This archive was generated by hypermail 2.1.4 : Sun Apr 21 2002 - 00:29:06 EDT