Subject: Patch: UCS & text import
From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Thu Jun 21 2001 - 08:59:12 CDT
This patch fixes a couple of UCS-2 related problems.
Firstly we use a couple of special UCS-2 values in the internal
document format that can cause problems. For field start and
end markers we used U+FFFE and U+FFFD but both are legal characters.
Importing any kind of file which uses them confuses Abi and makes
her crash. I've changed the special values to ones in the Unicode
"Private Use Area". I think this is the right thing.
Fixing this uncovered a problem importing text with a BOM since this
is U+FFFE. This is now handled correctly.
Andrew Dunbar.
-- http://linguaphile.sourceforge.net
Index: src/af/util/xp/ut_types.h =================================================================== RCS file: /cvsroot/abi/src/af/util/xp/ut_types.h,v retrieving revision 1.54 diff -u -r1.54 ut_types.h --- src/af/util/xp/ut_types.h 2001/06/18 15:11:05 1.54 +++ src/af/util/xp/ut_types.h 2001/06/21 13:34:31 @@ -133,8 +133,9 @@ #define UCS_REPLACECHAR ((UT_UCSChar)0xFFFD) /* Note: the following are our interpretations, not Unicode's */ -#define UCS_FIELDSTART ((UT_UCSChar)0xFFFE) -#define UCS_FIELDEND ((UT_UCSChar)0xFFFD) +/* Note: use Unicode Private Use Area 0xE000 - 0xF8FF */ +#define UCS_FIELDSTART ((UT_UCSChar)0xE000) +#define UCS_FIELDEND ((UT_UCSChar)0xE001) #if 1 /* try to use the unicode values for special chars */ Index: src/wp/impexp/xp/ie_imp_Text.cpp =================================================================== RCS file: /cvsroot/abi/src/wp/impexp/xp/ie_imp_Text.cpp,v retrieving revision 1.31 diff -u -r1.31 ie_imp_Text.cpp --- src/wp/impexp/xp/ie_imp_Text.cpp 2001/06/21 07:46:36 1.31 +++ src/wp/impexp/xp/ie_imp_Text.cpp 2001/06/21 13:36:27 @@ -108,6 +108,16 @@ } while (!m_Mbtowc.mbtowc(wc,b)); + // Watch for evil Unicode values! + // Surrogates + UT_ASSERT(!(wc >= 0xD800 && wc <= 0xDFFF)); + // Private Use Area + UT_ASSERT(!((wc >= 0xDB80 && wc <= 0xDBFF)||(wc >= 0xE000 && wc <= 0xF8FF))); + // AbiWord control characters + UT_ASSERT(wc != UCS_FIELDSTART && wc != UCS_FIELDEND); + // Illegal characters + UT_ASSERT(wc != 0xFFFE && wc != 0xFFFF); + ucs = m_ucsLookAhead; m_ucsLookAhead = wc; @@ -611,6 +621,7 @@ { UT_ASSERT(pStream); + bool bFirstChar = true; UT_GrowBuf gbBlock(1024); UT_UCSChar c; @@ -629,17 +640,23 @@ // we interpret either CRLF, CR, or LF as a paragraph break. // we also accept U+2028 (line separator) and U+2029 (para separator) // especially since these are recommended by Mac OS X. - + // flush out what we have if (gbBlock.getLength() > 0) X_ReturnNoMemIfError(ins.insertSpan(gbBlock)); X_ReturnNoMemIfError(ins.insertBlock()); break; + case UCS_BOM: + // This is Byte Order Mark at the start of file, Zero Width Non Joiner elsewhere + if (bFirstChar) + break; + default: X_ReturnNoMemIfError(gbBlock.append(&c,1)); break; } + bFirstChar = false; } if (gbBlock.getLength() > 0)
_________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
This archive was generated by hypermail 2b25 : Thu Jun 21 2001 - 08:57:11 CDT