Re: commit: abi: UTF8String class

From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Sat Apr 20 2002 - 08:19:25 EDT

  • Next message: Calvin Arndt: "Re: Ready for the Big Time!"

     --- F J Franklin <F.J.Franklin@sheffield.ac.uk>
    wrote: > > wrote: > o new UTF8String class (untested)
    > >
    > > If this is part of the new unicodization to
    > support
    > > full-unicode, there's some stuff we need to
    > discuss.
    >
    > Wasn't intended as such. phearbear says QNX wants to
    > use UTF-8 whereas
    > Abi uses UCS-2 and I decided to write the UTF8String
    > class to facilitate
    > the conversion. Strings are stored internally as
    > UTF-8 byte sequences,
    > and there is a home-made iterator for accessing the
    > string sequence by
    > sequence; and a fn. for converting current sequence
    > to UCS-4.
    >
    > Currently conversion to UTF-8 is only from UCS-2,
    > but conversion from
    > UCS-4 would be a trivial change. (I'm assuming that
    > UCS-2 is the first
    > 65536 codes of UCS-4 - is this correct?)

    Well not exactly, there is plenty of hazy stuff in
    Unicode unfortunately and this is the reason why I
    don't think it's a good idea for us to rush into the
    new way of doing things. This is one of the hazy
    areas
    and I'll attempt to explain it but you're better off
    reading all the documentation you can find at
    http://www.unicode.org and reading through a few
    mailing list archives that deal with Unicode issues.
    UCS-2 is a sixteen bit encoding which supports the
    old 16-bit Unicode and as such is what you suggest.
    UCS-4 seems to be an exact synonym for UTF-32 but you
    better check!
    UTF-16 is an encoding which allows the 32 bit Unicode
    range to be represented in a series of one or two
    16 bit fields. When two fields are needed, each is
    called a "surrogate".
    UTF-32 is a 32-bit encoding where a 32-bit character
    code is encoded in a single 32-bit field. Not all
    values are legal however.

    UTF-16 vs. UCS-2: Unicode were adpoted early by
    Microsoft for Windows NT, and by Java. Both chose to
    use UCS-2. This was back when everybody thought 16
    bits would be plenty. Unicode has since been updated
    to 32 bits.
    Windows XP and up seem refer to their encoding simply
    as "Unicode" but it behaves as either UCS-2 or UTF-16
    depending on a registry setting! I'm not sure what
    the behaviour of Windows XP is.
    I'm not sure whether Java now uses UTF-16 or not and
    if so, I'm not sure whether they still use the term
    UCS-2.

    My rule of thumb: Any encoding starting with "UCS" is
    to be considered deprecated. Use UCS encodings and
    UCS encoding names only when specifically dealing with
    a UCS encoding. For instance, converting to old
    Windows NT filenames or GUI strings. Do not *ever*
    say "UCS-*" when you mean "UTF-*". People are already
    confused over this and we as developers of a multi-
    lingual word processor need to have this very well
    understood. (Same goes for saying ASCII when you
    mean ISO-8859-1 or even ISO-8859-*)

    Please read up on this since I'm not fully up to date
    because of my months on the road and not currently
    owning a machine or having an internet connection.

    > As a string class it's not nearly as functional as
    > the others, but it's
    > not really intended as a replacement.

    Well pretty soon we're going to need a real
    replacement. Dom and I are both in favour of the
    replacement being UTF-8 but some here seem to want
    UTF-32.

    > > We need to design the system so that a string is
    > not
    > > built from a series of UTF-8 (or UTF-32)
    > characters
    > > directly, but a series of "composed character"
    > which
    > > in turn are a series of UTF-8 characters, the
    > first
    > > being the main character, the remainder being
    > zero-
    > > width modifiers. We need this to support proper
    > > internationalization. We probably need much
    > > discussion first actually.
    >
    > Not sure I understand this. Can you explain how to
    > use zero-width
    > modifiers?

    They're also called "combining characters". Such as
    the acute accent or the umlaut (really a dieresis).
    Instead of representing "Á" as U+00C1, it can be
    represented as U+0041 U+0301. Currently this half-
    works in AbiWord if you have TrueType fonts (or on
    Windows) and if you turn off the RemapGlyphs hack in
    your profile.
    If you think this is a dumb idea then you haven't read
    enough about Unicode so go read up (not you fjf, but
    all of the Abi developers). Not just Unicode uses
    such characters, by the way. The standard Vietnamese
    encodings all use this feature. Vietnamese fonts
    which
    include all combinations of letter+accent+tone mark
    are very rare but those with "combining characters"
    are
    quite common.
    As for southeast Asian and Indic languages, I don't
    believe Unicode even bothers to include all the myriad
    combinations of letter+vowel mark+funky language
    feature. Combining characters are generally
    considered
    to be a good thing, and the way forward. They will
    make searching, sorting, capitalization and maybe more
    much simpler even for Western languages.
    Once we understand these issues we then have to look
    into "Unicode normalization"...

    > Frank

    Hope this helps, and I hope people other than just
    Frank read it. Let's do Unicode properly and be the
    best Word Processor for Vietnamese and Thai on any
    platform! (:

    Andrew Dunbar.

    =====
    http://linguaphile.sourceforge.net http://www.abisource.com

    __________________________________________________
    Do You Yahoo!?
    Everything you'll ever need on one web page
    from News and Sport to Email and Music Charts
    http://uk.my.yahoo.com



    This archive was generated by hypermail 2.1.4 : Sat Apr 20 2002 - 08:20:25 EDT