Re: Fwd: Unicode Private User Area conflict

From: Martin Sevior (msevior@mccubbin.ph.unimelb.edu.au)
Date: Sat Feb 16 2002 - 06:52:23 GMT

  • Next message: Hubert Figuiere: "Bugzilla returns"

    On Sat, 16 Feb 2002, Anthony Fok wrote:

    > Hello all,
    >
    > On Fri, Feb 15, 2002 at 11:42:02AM +0000, Andrew Dunbar wrote:
    > > ----- Forwarded message from Fencol Yung -----
    > > > I found that the following Unicode Private User
    > > > character defined in
    > > > AbiWord 0.9.6 conflicted with our charset:
    > > >
    > > > UCS_FIELDSTART 0xe000
    > > > UCS_FIELDEND 0xe001
    > > > UCS_BOOKMARKSTART 0xe002
    > > > UCS_BOOKMARKEND 0xe003
    > > >
    > > > I read from the mail archives that these character is actually move from
    > > > elsewhere. If I move them to other position to resolve the conflict,
    > > > will that create any drawback? Actually what is the purpose of those
    > > > special character?
    >
    > Yes, these four codepoints are at the very beginning of the Unicode
    > PUA, and thus clash with at least 2 major charsets: GB18030 and
    > BIG5-HKSCS. (GB18030: New simplified Chinese encoding standard that
    > maps to Unicode one-to-one; BIG5-HKSCS: Extension to the traditional
    > Chinese Big5 encoding, by the Hong Kong government.)
    >
    > As a matter of fact, we ran into the same problem about a
    > month ago when our product was being certified for GB18030 compliance
    > at the official Chinese Testing Agency. Three of the GB18030 test
    > documents map to Unicode PUA U+E000 to U+E765. Loading the first of
    > these 3 test documents would cause AbiWord to crash immediately.
    >
    > We had to moved these {FIELD,BOOKMARK}{START,END} out of the way (to
    > U+F000..U+F003 temporarily) in order to pass the certification.
    >
    > But I agree that even putting them in U+F000..U+F003 is problematic.

    We should move to 32 bit internal presentation so we can find code that
    won't be touched.

    >
    > > Yes I moved them. I think we only had the first two at that time.
    > > They are used internally by AbiWord. And are assumed not to be
    > > imported by any document. This is not a good assumption and we really
    > > need to redesign this part of the code IMHO to use some kind of
    > > out-of-band data instead of overriding the characters. Possibly we
    > > can find some true "never to be used" characters but I doubt it. The
    > > people who know these parts of the code (please grep for them) should
    > > be able to discuss the whys, wherefores, and possible solutions in
    > > this list. Before I moved them there were in conflicted with illegal
    > > and/or BOM codes from memory which messed with importers and
    > > exporters and generally seemed like a bad idea.
    >
    > > Hope someone has a good idea to fix this. Merely moving them around
    > > is probably going to keep breaking somebody's private stuff here and
    > > there...
    >
    > I agree. Nevertheless, I think we do need to move them now until a
    > better solution is found. The Unicode PUA is in U+E000..U+F8FF.
    > The range U+E000..U+E765 is explicitly set as three User Defined
    > Areas (UDAs) in the GB18030 standard, so we must stay out of this
    > range, otherwise AbiWord would not comply with GB18030 (mandatory in
    > Mainland China). (Yes, the mapping table goes higher, but no one uses
    > anything that high yet, not even the GB18030 test documents. :-)
    >
    > U+E000-U+F848 maps to the EUDC (End-User Defined Characters) in the
    > CP950 / BIG5 / BIG5-HKSCS standard as compatibility codepoints, and it
    > would be best to stay out of these ranges too. There is no mapping
    > from BIG5-HKSCS to U+F849..U+F8FF.
    >
    > So, for now, U+F849..U+F8FF is free. I suggest putting the four
    > AbiWord internal control codes in U+F850..U+F853 for now. Yes, this
    > only solve the symptom, but it is important to have this fix _now_ for
    > GB18030 and HKSCS compliance. This will do until the real cure comes.
    > :-)
    >

    A real cure is 32 bit internally. That will come after 1.0. Thanks very
    much for this work around.

    > A patch is attached. Thanks! :-)
    >

    Even better :-)

    Martin



    This archive was generated by hypermail 2.1.4 : Sat Feb 16 2002 - 01:56:31 GMT