Re: Fwd: Unicode Private User Area conflict

From: Anthony Fok (anthony@thizlinux.com)
Date: Sat Feb 16 2002 - 06:20:54 GMT

  • Next message: Martin Sevior: "Re: Fwd: Unicode Private User Area conflict"

    Hello all,

    On Fri, Feb 15, 2002 at 11:42:02AM +0000, Andrew Dunbar wrote:
    > ----- Forwarded message from Fencol Yung -----
    > > I found that the following Unicode Private User
    > > character defined in
    > > AbiWord 0.9.6 conflicted with our charset:
    > >
    > > UCS_FIELDSTART 0xe000
    > > UCS_FIELDEND 0xe001
    > > UCS_BOOKMARKSTART 0xe002
    > > UCS_BOOKMARKEND 0xe003
    > >
    > > I read from the mail archives that these character is actually move from
    > > elsewhere. If I move them to other position to resolve the conflict,
    > > will that create any drawback? Actually what is the purpose of those
    > > special character?

    Yes, these four codepoints are at the very beginning of the Unicode
    PUA, and thus clash with at least 2 major charsets: GB18030 and
    BIG5-HKSCS. (GB18030: New simplified Chinese encoding standard that
    maps to Unicode one-to-one; BIG5-HKSCS: Extension to the traditional
    Chinese Big5 encoding, by the Hong Kong government.)

    As a matter of fact, we ran into the same problem about a
    month ago when our product was being certified for GB18030 compliance
    at the official Chinese Testing Agency. Three of the GB18030 test
    documents map to Unicode PUA U+E000 to U+E765. Loading the first of
    these 3 test documents would cause AbiWord to crash immediately.

    We had to moved these {FIELD,BOOKMARK}{START,END} out of the way (to
    U+F000..U+F003 temporarily) in order to pass the certification.

    But I agree that even putting them in U+F000..U+F003 is problematic.

    > Yes I moved them. I think we only had the first two at that time.
    > They are used internally by AbiWord. And are assumed not to be
    > imported by any document. This is not a good assumption and we really
    > need to redesign this part of the code IMHO to use some kind of
    > out-of-band data instead of overriding the characters. Possibly we
    > can find some true "never to be used" characters but I doubt it. The
    > people who know these parts of the code (please grep for them) should
    > be able to discuss the whys, wherefores, and possible solutions in
    > this list. Before I moved them there were in conflicted with illegal
    > and/or BOM codes from memory which messed with importers and
    > exporters and generally seemed like a bad idea.

    > Hope someone has a good idea to fix this. Merely moving them around
    > is probably going to keep breaking somebody's private stuff here and
    > there...

    I agree. Nevertheless, I think we do need to move them now until a
    better solution is found. The Unicode PUA is in U+E000..U+F8FF.
    The range U+E000..U+E765 is explicitly set as three User Defined
    Areas (UDAs) in the GB18030 standard, so we must stay out of this
    range, otherwise AbiWord would not comply with GB18030 (mandatory in
    Mainland China). (Yes, the mapping table goes higher, but no one uses
    anything that high yet, not even the GB18030 test documents. :-)

    U+E000-U+F848 maps to the EUDC (End-User Defined Characters) in the
    CP950 / BIG5 / BIG5-HKSCS standard as compatibility codepoints, and it
    would be best to stay out of these ranges too. There is no mapping
    from BIG5-HKSCS to U+F849..U+F8FF.

    So, for now, U+F849..U+F8FF is free. I suggest putting the four
    AbiWord internal control codes in U+F850..U+F853 for now. Yes, this
    only solve the symptom, but it is important to have this fix _now_ for
    GB18030 and HKSCS compliance. This will do until the real cure comes.
    :-)

    A patch is attached. Thanks! :-)

    Anthony

    -- 
    Anthony Fok Tung-Ling
    ThizLinux Laboratory   <anthony@thizlinux.com> http://www.thizlinux.com/
    Debian Chinese Project <foka@debian.org>       http://www.debian.org/intl/zh/
    Come visit Our Lady of Victory Camp!           http://www.olvc.ab.ca/
    




    This archive was generated by hypermail 2.1.4 : Sat Feb 16 2002 - 01:15:44 GMT