From: phearbear (phearbear@home.se)
Date: Sun Apr 21 2002 - 00:13:01 EDT
Andrew Dunbar wrote:
> --- Tomas Frydrych <tomas@frydrych.uklinux.net>
>wrote: >
>
>>>Andrew Dunbar <hippietrail@yahoo.com> wrote:
>>>
>>>Well pretty soon we're going to need a real
>>>replacement. Dom and I are both in favour of the
>>>replacement being UTF-8 but some here seem to want
>>>UTF-32.
>>>
>>UTF-8 is an encoding scheme that is intended to
>>allow Unicode
>>communication between separate processes over 8-bit
>>channels.
>>For that it is great, but that's about the only
>>thing it is really good
>>for. UTF-8 processing is cumbersome, and as such it
>>is completely
>>unsuitable format to use for the piecetable. We need
>>a fixed with
>>encoding for that, such as the curent UCS-2, i.e.,
>>UTF-32.
>>
>
>Please back up these comments. A lot of people,
>before
>they are familiar with Unicode and UTF-8 seem to think
>this. I did too. Then I read reams and reams of
>newsgroups and mailing lists and FAQs. Now I know why
>Qt, GTK, QNX, and others use UTF-8 internally.
>People seem to think that because UTF-8 encodes
>characters as variable length runs of bytes that this
>is somehow computationally expensive to handle. Not
>so. You can use existing 8-bit string functions on
>it.
>It is backwards compatible with ASCII. You can scan
>forwards and backwards effortlessly. You can always
>tell which character in a sequence a given byte
>belongs to.
>People think random access to these strings using
>array operator will cost the earth. Guess what - very
>little code access strings as arrays - especially in
>a Word Processor. Of the code which does, very little
>of that needs to. Even when you do perform lots of
>array operations on a UTF-8 string, people have done
>extensive tests showing that the cost is extremely
>negligable - look in the Unicode literature and you
>will find all this information.
>People think that UCS-2, UTF-16, or UTF-32 mean we can
>have perfect random access to strings because a
>characters is always represented as a single word or
>longword. Not so. UCS-2 should but this term is
>often (by Microsoft) used to refer to UTF-16. UTF-16
>uses a mechanism called "surrogates" whereby a single
>character may need two words to represent it. There
>goes your free array access. Even UTF-32 is not safe
>from this. Because Unicode requires "combining
>characters". This means that "á" may be represented
>as "a" followed by a non-spacing "´" acute accent.
>Some people think this is also silly. These people
>need to go read all about Unicode before they embark
>on seriously multilingual software. Vietnames is
>possible to support without combining characters but
>you won't be able to view the results because no
>Vietnames fonts exist that work this way - they all
>expect to use combining characters. Thai needs them.
>Hindi needs them. All Indian/Indic languages need
>them.
>
>So to sum up, the two arguments not to use UTF-8
>internally are:
>
>1) Array access is too slow.
>
>- This is not true and it is seldom needed.
>
>2) UTF-8 means you have to handle a series of values
> for a single on-screen character.
>
>- *All* Unicode encodings need this anyway!
>
>But look around the internet for better arguments and
>better written arguments.
>
>Andrew Dunbar.
>
>=====
>http://linguaphile.sourceforge.net http://www.abisource.com
>
>__________________________________________________
>Do You Yahoo!?
>Everything you'll ever need on one web page
>from News and Sport to Email and Music Charts
>http://uk.my.yahoo.com
>
>
Hi
Excuse my lazyness, but scanning through all unicode.org isn't really
what i like to spend my week on ;) Any special articles you recommend us
to read?
/Johan
This archive was generated by hypermail 2.1.4 : Sun Apr 21 2002 - 00:15:21 EDT