Re: commit: abi: UTF8String class

From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Sun Apr 21 2002 - 22:07:21 EDT

Next message: Andrew Dunbar: "Re: Pango? (was Re: commit: abi: UTF8String class)"

Previous message: Andrew Dunbar: "Re: commit: abi: UTF8String class"
In reply to: Tomas Frydrych: "Re: commit: abi: UTF8String class"
Next in thread: Martin Sevior: "A quick clarification please."
Next in thread: Joaquin Cuenca Abela: "Re: commit: abi: UTF8String class"
Reply: Martin Sevior: "A quick clarification please."
Reply: Dom Lachowicz: "Re: commit: abi: UTF8String class"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

--- Tomas Frydrych <tomas@frydrych.uklinux.net>
wrote: > > Andrew Dunbar wrote:
> > People seem to think that because UTF-8 encodes
> > characters as variable length runs of bytes that
> this
> > is somehow computationally expensive to handle.
> Not
> > so. You can use existing 8-bit string functions
> on
> > it.
> Not really; you can use functions like strcpy to
> copy utf-8 strings,
> but you cannot use functions like strlen to find out
> how many
> characters there are in the string. At the point you

You can't do this with the widechar version of strlen
for Unicode either. This will only tell you the
number of codepoints. There are at least four ways
to interpret "length" for a Unicode string:
1) amount of memory.
2) number of codepoints.
3) number of on-screen character positions.
4) length in pixels of rendered strings.

You need to think about this before choosing an
encoding. What seems the simple way usually turns out
not to be sufficient for proper Unicode handling.

> start dealing
> with characters rather than bytes, you need extra
> processing and
> new utf-8 specific functions -- apart from copying
> the string the
> standard C library is of little use. The extra
> processing may not be
> huge for a single operation, but it is there. There
> is nothing much
> you are getting in return for it; you will save some
> memory for some
> users, but because the memory requirenments are
> non-linear
> across the Unicode space, you also end up penalizing
> other users.

The main thing we get is that we won't be stuck in a
system that is able to make wrong assumptions about
codepoints vs. characters vs. glyphs etc - if we make
as few assumptions as possible it's easy to fix when
we realize we missed something tricky in the design
stage.

> > It is backwards compatible with ASCII.
> What is the value of that for a Unicode-based
> wordprocessor?
>
> > You can scan forwards and backwards effortlessly.
> You can always
> > tell which character in a sequence a given byte
> belongs to.
> You have to _examine_ the byte to be able to do
> that, that costs
> time.

You have to do it anyway for combining characters!
I can assure you the amount of time used here is not
going to impact AbiWord's performance on anything
faster than a Z-80. This is microoptimization. There
must be a million places in AbiWord where we can
already save a handful of clock cycles if we're that
worried about it.

> > People think random access to these strings using
> > array operator will cost the earth. Guess what -
> very
> > little code access strings as arrays - especially
> in
> > a Word Processor. Of the code which does, very
> little
> > of that needs to. Even when you do perform lots
> of
> > array operations on a UTF-8 string, people have
> done
> > extensive tests showing that the cost is extremely
> > negligable
> Anybody actually designed a wordprocessor that uses
> utf-8
> internally? It is one thing to use utf-8 with a
> widget library, and
> another thing to use it in a wordprocessor where
> your strings are
> dynamic, ever changing, and where you need to
> itterate through
> them all the time. I want the piecetable and
> layout-engine
> operations to be optimised for speed, and so would
> like to use an
> encoding that requires the least amount of
> processing; utf-8 is not
> that encoding.

I'm sure there are many ways to design classes here
that can make intelligent use of whichever encoding.
But I want to hear Dom's opinion on this one.
Again though I think it's already more complex that
you think due to combining characters (and zero width)
characters. But maybe not because as I've said I
don't
really know much about the piecetable.

> > People think that UCS-2, UTF-16, or UTF-32 mean we
> can
> > have perfect random access to strings because a
> > characters is always represented as a single word
> or
> > longword ... Unicode requires "combining
> > characters". This means that "á" may be
> represented
> > as "a" followed by a non-spacing "´" acute accent.
> I think you are confusing two different issues here.
> In utf-32 you
> may need several codepoints to represent what a user
> might deem
> to be a "character" but the codepoints per se are of
> fixed width. In
> utf-8, on the other hand, the codepoints themselves
> are
> represented by byte sequences of variable width. The
> issue to me
> seems to be, how the combining characters are input.
> If the user
> inputs them as a series of separate keystrokes, then
> we can treat
> them as separate "characters" for purposes of
> navigation and
> deletion. This requires no real changes to AW, the

Well not exactly. I happen to know that at least
Arabic keyboards under certain OSes (maybe all but
I'm not sure) will return a series of two keycodes for
a single keypress when that key represents a ligature
of two characters. I'm pretty sure a Devanagari
keyboard could do the same or worse. When you hit
such a key on an Arabic computer it takes, rightly,
two backspaces to delete it.

> bidi build
> already handles overstriking characters. If, on the
> other hand, the
> user presses a single key and it generates a
> sequence of Unicode
> codepoints, then this will, obviously, need some
> changes. And I
> agree that we need to discuss how to deal with it,
> but that is quite
> something different than the problems caused by
> variable width of
> the actual codepoints introduced by utf-8.

It could be. I really wish I know more about the
piecetable.

Andrew Dunbar.

> Tomas

=====
http://linguaphile.sourceforge.net http://www.abisource.com

__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com

Next message: Andrew Dunbar: "Re: Pango? (was Re: commit: abi: UTF8String class)"
Previous message: Andrew Dunbar: "Re: commit: abi: UTF8String class"
In reply to: Tomas Frydrych: "Re: commit: abi: UTF8String class"
Next in thread: Martin Sevior: "A quick clarification please."
Next in thread: Joaquin Cuenca Abela: "Re: commit: abi: UTF8String class"
Reply: Martin Sevior: "A quick clarification please."
Reply: Dom Lachowicz: "Re: commit: abi: UTF8String class"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.4 : Sun Apr 21 2002 - 22:08:27 EDT