Re: commit: abi: UTF8String class

From: Tomas Frydrych (tomas@frydrych.uklinux.net)
Date: Sun Apr 21 2002 - 12:44:43 EDT

  • Next message: Tomas Frydrych: "Re: commit: abi: UTF8String class"

    > Andrew Dunbar wrote:
    > People seem to think that because UTF-8 encodes
    > characters as variable length runs of bytes that this
    > is somehow computationally expensive to handle. Not
    > so. You can use existing 8-bit string functions on
    > it.
    Not really; you can use functions like strcpy to copy utf-8 strings,
    but you cannot use functions like strlen to find out how many
    characters there are in the string. At the point you start dealing
    with characters rather than bytes, you need extra processing and
    new utf-8 specific functions -- apart from copying the string the
    standard C library is of little use. The extra processing may not be
    huge for a single operation, but it is there. There is nothing much
    you are getting in return for it; you will save some memory for some
    users, but because the memory requirenments are non-linear
    across the Unicode space, you also end up penalizing other users.

    > It is backwards compatible with ASCII.
    What is the value of that for a Unicode-based wordprocessor?

    > You can scan forwards and backwards effortlessly. You can always
    > tell which character in a sequence a given byte belongs to.
    You have to _examine_ the byte to be able to do that, that costs
    time.

    > People think random access to these strings using
    > array operator will cost the earth. Guess what - very
    > little code access strings as arrays - especially in
    > a Word Processor. Of the code which does, very little
    > of that needs to. Even when you do perform lots of
    > array operations on a UTF-8 string, people have done
    > extensive tests showing that the cost is extremely
    > negligable
    Anybody actually designed a wordprocessor that uses utf-8
    internally? It is one thing to use utf-8 with a widget library, and
    another thing to use it in a wordprocessor where your strings are
    dynamic, ever changing, and where you need to itterate through
    them all the time. I want the piecetable and layout-engine
    operations to be optimised for speed, and so would like to use an
    encoding that requires the least amount of processing; utf-8 is not
    that encoding.

    > People think that UCS-2, UTF-16, or UTF-32 mean we can
    > have perfect random access to strings because a
    > characters is always represented as a single word or
    > longword ... Unicode requires "combining
    > characters". This means that "á" may be represented
    > as "a" followed by a non-spacing "´" acute accent.
    I think you are confusing two different issues here. In utf-32 you
    may need several codepoints to represent what a user might deem
    to be a "character" but the codepoints per se are of fixed width. In
    utf-8, on the other hand, the codepoints themselves are
    represented by byte sequences of variable width. The issue to me
    seems to be, how the combining characters are input. If the user
    inputs them as a series of separate keystrokes, then we can treat
    them as separate "characters" for purposes of navigation and
    deletion. This requires no real changes to AW, the bidi build
    already handles overstriking characters. If, on the other hand, the
    user presses a single key and it generates a sequence of Unicode
    codepoints, then this will, obviously, need some changes. And I
    agree that we need to discuss how to deal with it, but that is quite
    something different than the problems caused by variable width of
    the actual codepoints introduced by utf-8.

    Tomas



    This archive was generated by hypermail 2.1.4 : Sun Apr 21 2002 - 12:49:49 EDT