From: Martin Sevior (msevior@mccubbin.ph.unimelb.edu.au)
Date: Sun Apr 21 2002 - 08:37:58 EDT
On Sun, 21 Apr 2002, [iso-8859-1] Andrew Dunbar wrote:
> --- Tomas Frydrych <tomas@frydrych.uklinux.net>
> wrote: >
> > > Andrew Dunbar <hippietrail@yahoo.com> wrote:
> >
> > > Well pretty soon we're going to need a real
> > > replacement. Dom and I are both in favour of the
> > > replacement being UTF-8 but some here seem to want
> > > UTF-32.
> >
> > UTF-8 is an encoding scheme that is intended to
> > allow Unicode
> > communication between separate processes over 8-bit
> > channels.
> > For that it is great, but that's about the only
> > thing it is really good
> > for. UTF-8 processing is cumbersome, and as such it
> > is completely
> > unsuitable format to use for the piecetable. We need
> > a fixed with
> > encoding for that, such as the curent UCS-2, i.e.,
> > UTF-32.
>
> Please back up these comments. A lot of people,
> before
> they are familiar with Unicode and UTF-8 seem to think
> this. I did too. Then I read reams and reams of
> newsgroups and mailing lists and FAQs. Now I know why
> Qt, GTK, QNX, and others use UTF-8 internally.
> People seem to think that because UTF-8 encodes
> characters as variable length runs of bytes that this
> is somehow computationally expensive to handle. Not
> so. You can use existing 8-bit string functions on
> it.
> It is backwards compatible with ASCII. You can scan
> forwards and backwards effortlessly. You can always
> tell which character in a sequence a given byte
> belongs to.
> People think random access to these strings using
> array operator will cost the earth. Guess what - very
> little code access strings as arrays - especially in
> a Word Processor. Of the code which does, very little
> of that needs to. Even when you do perform lots of
> array operations on a UTF-8 string, people have done
> extensive tests showing that the cost is extremely
> negligable - look in the Unicode literature and you
> will find all this information.
> People think that UCS-2, UTF-16, or UTF-32 mean we can
> have perfect random access to strings because a
> characters is always represented as a single word or
> longword. Not so. UCS-2 should but this term is
> often (by Microsoft) used to refer to UTF-16. UTF-16
> uses a mechanism called "surrogates" whereby a single
> character may need two words to represent it. There
> goes your free array access. Even UTF-32 is not safe
> from this. Because Unicode requires "combining
> characters". This means that "á" may be represented
> as "a" followed by a non-spacing "´" acute accent.
> Some people think this is also silly. These people
> need to go read all about Unicode before they embark
> on seriously multilingual software. Vietnames is
> possible to support without combining characters but
> you won't be able to view the results because no
> Vietnames fonts exist that work this way - they all
> expect to use combining characters. Thai needs them.
> Hindi needs them. All Indian/Indic languages need
> them.
>
> So to sum up, the two arguments not to use UTF-8
> internally are:
>
> 1) Array access is too slow.
>
> - This is not true and it is seldom needed.
>
> 2) UTF-8 means you have to handle a series of values
> for a single on-screen character.
>
> - *All* Unicode encodings need this anyway!
>
> But look around the internet for better arguments and
> better written arguments.
>
UTF-8 is great for communicating between the piecetable and the widgets. I
think we should definately do this. What I don't want is for us to store
our text as UTF-8 in the piecetable. We have a *LOT* of code that expects
that every position in the piecetable corresponds to an extra letter of
text.
What I think we should do is store our unicode as UT_uint32 in the
piecetable which can then be randomly accessed the same way we do things
now.
We just need to make a global UT_UCSChar => UT_UCSChar32 (==
UT_uint32) plus some hardwired fixes for UT_unit16 about the place and
some routines to do UT_UCSChar32 +> UTF_8 conversion for transporting
unicode to the screen.
By the way, glib2.0 (which we need for pango) has a unicode variable which
is is just type UT_uint32 and plenty of UT_uint32 <=> UTF_8 conversion and
conveince routines.
Cheers
Martin
This archive was generated by hypermail 2.1.4 : Sun Apr 21 2002 - 08:39:52 EDT