From: Martin Sevior (msevior@mccubbin.ph.unimelb.edu.au)
Date: Sat Feb 16 2002 - 06:52:23 GMT
On Sat, 16 Feb 2002, Anthony Fok wrote:
> Hello all,
>
> On Fri, Feb 15, 2002 at 11:42:02AM +0000, Andrew Dunbar wrote:
> > ----- Forwarded message from Fencol Yung -----
> > > I found that the following Unicode Private User
> > > character defined in
> > > AbiWord 0.9.6 conflicted with our charset:
> > >
> > > UCS_FIELDSTART 0xe000
> > > UCS_FIELDEND 0xe001
> > > UCS_BOOKMARKSTART 0xe002
> > > UCS_BOOKMARKEND 0xe003
> > >
> > > I read from the mail archives that these character is actually move from
> > > elsewhere. If I move them to other position to resolve the conflict,
> > > will that create any drawback? Actually what is the purpose of those
> > > special character?
>
> Yes, these four codepoints are at the very beginning of the Unicode
> PUA, and thus clash with at least 2 major charsets: GB18030 and
> BIG5-HKSCS. (GB18030: New simplified Chinese encoding standard that
> maps to Unicode one-to-one; BIG5-HKSCS: Extension to the traditional
> Chinese Big5 encoding, by the Hong Kong government.)
>
> As a matter of fact, we ran into the same problem about a
> month ago when our product was being certified for GB18030 compliance
> at the official Chinese Testing Agency. Three of the GB18030 test
> documents map to Unicode PUA U+E000 to U+E765. Loading the first of
> these 3 test documents would cause AbiWord to crash immediately.
>
> We had to moved these {FIELD,BOOKMARK}{START,END} out of the way (to
> U+F000..U+F003 temporarily) in order to pass the certification.
>
> But I agree that even putting them in U+F000..U+F003 is problematic.
We should move to 32 bit internal presentation so we can find code that
won't be touched.
>
> > Yes I moved them. I think we only had the first two at that time.
> > They are used internally by AbiWord. And are assumed not to be
> > imported by any document. This is not a good assumption and we really
> > need to redesign this part of the code IMHO to use some kind of
> > out-of-band data instead of overriding the characters. Possibly we
> > can find some true "never to be used" characters but I doubt it. The
> > people who know these parts of the code (please grep for them) should
> > be able to discuss the whys, wherefores, and possible solutions in
> > this list. Before I moved them there were in conflicted with illegal
> > and/or BOM codes from memory which messed with importers and
> > exporters and generally seemed like a bad idea.
>
> > Hope someone has a good idea to fix this. Merely moving them around
> > is probably going to keep breaking somebody's private stuff here and
> > there...
>
> I agree. Nevertheless, I think we do need to move them now until a
> better solution is found. The Unicode PUA is in U+E000..U+F8FF.
> The range U+E000..U+E765 is explicitly set as three User Defined
> Areas (UDAs) in the GB18030 standard, so we must stay out of this
> range, otherwise AbiWord would not comply with GB18030 (mandatory in
> Mainland China). (Yes, the mapping table goes higher, but no one uses
> anything that high yet, not even the GB18030 test documents. :-)
>
> U+E000-U+F848 maps to the EUDC (End-User Defined Characters) in the
> CP950 / BIG5 / BIG5-HKSCS standard as compatibility codepoints, and it
> would be best to stay out of these ranges too. There is no mapping
> from BIG5-HKSCS to U+F849..U+F8FF.
>
> So, for now, U+F849..U+F8FF is free. I suggest putting the four
> AbiWord internal control codes in U+F850..U+F853 for now. Yes, this
> only solve the symptom, but it is important to have this fix _now_ for
> GB18030 and HKSCS compliance. This will do until the real cure comes.
> :-)
>
A real cure is 32 bit internally. That will come after 1.0. Thanks very
much for this work around.
> A patch is attached. Thanks! :-)
>
Even better :-)
Martin
This archive was generated by hypermail 2.1.4 : Sat Feb 16 2002 - 01:56:31 GMT