Re: commit: abi: UTF8String class

From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Sun Apr 21 2002 - 09:59:10 EDT

  • Next message: Andrew Dunbar: "Re: commit: abi: UTF8String class"

     --- F J Franklin <F.J.Franklin@sheffield.ac.uk>
    wrote: > I think Java uses UTF-16.

    I bet that older versions of Java used UCS-2 and newer
    versions use UTF-16 since they're mostly compatible,
    but that's just a guess.

    > UTF-32 is a subset of UCS-4, I believe. My
    > impression is that UTF-8 and
    > UCS-4 have less to do with UNICODE than UTF-16 and
    > UTF-32.

    I still don't know exactly how all of them relate.
    This is an annoyance of Unicode - many things at least
    seem muddy to those of us on the outside.

    UCS is "Universal character set", which basically
    means the mapping of a number onto a character. As
    such UCS-2 *should* mean the old 16-bit range of
    Unicode characters and UCS-4 *should* mean the new
    (up to) 32-bit range of Unicode characters. UCS does
    not specify how this number is encoded into various
    sequences of bits/bytes/words/etc.

    UTF is "Unicode Transformation Format", which
    basically
    means the algorithm used to encode the UCS number of
    each character into one or more units. UTF-8
    encodes any UCS character index into one or more 8-bit
    units. UTF-16 encodes any UCS character index into one
    or more 16-bit units. UTF-32 encodes any UCS
    character index one (and no more) 32-bit units.

    That ought to be the definition but in practice or at
    least in common usage the terms all get muddied ):

    > I was using:
    >
    http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
    >
    > Some other links for the curious:
    > http://czyborra.com/utf/
    > http://www.tldp.org/HOWTO/Unicode-HOWTO-1.html
    > http://www.cl.cam.ac.uk/~mgk25/unicode.html

    Thanks for the pointers! Here are some I find
    useful:
    http://mail.nl.linux.org/linux-utf8/ (not just UTF-8)
    perl6-internals-unicode@perl.org/">http://archive.develooper.com/perl6-internals-unicode@perl.org/
    (old but it covers many issues we
    are going to be grappling with)
    http://mail.gnome.org/archives/gtk-i18n-list/ (Pango
    discussion takes place here)

    > Andrew, thanks for the answer. Personally I see no
    > problem using UTF-8
    > internally, but I don't do piecetable work so it's
    > not really my call. I
    > wasn't trying to preempt the decision; the new class
    > is just a utility.

    Neither do I. I eagerly await what specific things
    Dom hopefully has to say on Monday...
    Frank, I know you weren't preempting anything but I'm
    glad it came up since now is the time it needs to come
    up (:

    Andrew Dunbar.

    > Frank
    >
    > Francis James Franklin
    > F.J.Franklin@shef.ac.uk
    >
    > "No, she really likes me. She told me I look like
    > Britney Spears, and why
    > would you say that to somebody you don't like?"
    >
    > --- Elle Woods
    >
    >

    =====
    http://linguaphile.sourceforge.net http://www.abisource.com

    __________________________________________________
    Do You Yahoo!?
    Everything you'll ever need on one web page
    from News and Sport to Email and Music Charts
    http://uk.my.yahoo.com



    This archive was generated by hypermail 2.1.4 : Sun Apr 21 2002 - 10:00:17 EDT