From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Sat Apr 20 2002 - 23:35:52 EDT
--- Tomas Frydrych <tomas@frydrych.uklinux.net>
wrote: >
> > Andrew Dunbar <hippietrail@yahoo.com> wrote:
>
> > Well pretty soon we're going to need a real
> > replacement. Dom and I are both in favour of the
> > replacement being UTF-8 but some here seem to want
> > UTF-32.
>
> UTF-8 is an encoding scheme that is intended to
> allow Unicode
> communication between separate processes over 8-bit
> channels.
> For that it is great, but that's about the only
> thing it is really good
> for. UTF-8 processing is cumbersome, and as such it
> is completely
> unsuitable format to use for the piecetable. We need
> a fixed with
> encoding for that, such as the curent UCS-2, i.e.,
> UTF-32.
Please back up these comments. A lot of people,
before
they are familiar with Unicode and UTF-8 seem to think
this. I did too. Then I read reams and reams of
newsgroups and mailing lists and FAQs. Now I know why
Qt, GTK, QNX, and others use UTF-8 internally.
People seem to think that because UTF-8 encodes
characters as variable length runs of bytes that this
is somehow computationally expensive to handle. Not
so. You can use existing 8-bit string functions on
it.
It is backwards compatible with ASCII. You can scan
forwards and backwards effortlessly. You can always
tell which character in a sequence a given byte
belongs to.
People think random access to these strings using
array operator will cost the earth. Guess what - very
little code access strings as arrays - especially in
a Word Processor. Of the code which does, very little
of that needs to. Even when you do perform lots of
array operations on a UTF-8 string, people have done
extensive tests showing that the cost is extremely
negligable - look in the Unicode literature and you
will find all this information.
People think that UCS-2, UTF-16, or UTF-32 mean we can
have perfect random access to strings because a
characters is always represented as a single word or
longword. Not so. UCS-2 should but this term is
often (by Microsoft) used to refer to UTF-16. UTF-16
uses a mechanism called "surrogates" whereby a single
character may need two words to represent it. There
goes your free array access. Even UTF-32 is not safe
from this. Because Unicode requires "combining
characters". This means that "á" may be represented
as "a" followed by a non-spacing "´" acute accent.
Some people think this is also silly. These people
need to go read all about Unicode before they embark
on seriously multilingual software. Vietnames is
possible to support without combining characters but
you won't be able to view the results because no
Vietnames fonts exist that work this way - they all
expect to use combining characters. Thai needs them.
Hindi needs them. All Indian/Indic languages need
them.
So to sum up, the two arguments not to use UTF-8
internally are:
1) Array access is too slow.
- This is not true and it is seldom needed.
2) UTF-8 means you have to handle a series of values
for a single on-screen character.
- *All* Unicode encodings need this anyway!
But look around the internet for better arguments and
better written arguments.
Andrew Dunbar.
=====
http://linguaphile.sourceforge.net http://www.abisource.com
__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com
This archive was generated by hypermail 2.1.4 : Sat Apr 20 2002 - 23:36:56 EDT