request for help from CJK hackers


Subject: request for help from CJK hackers
From: Vlad Harchev (hvv@hippo.ru)
Date: Wed Nov 08 2000 - 13:13:46 CST


 Hi guys,

 It seems AbiWord-0.7.12 will be released in the begining of next week, so it
would be nice if all CJK issues were worked out.
 It seems that the major thing to be done to make AW seamlessly support CJK
languages is import of RTFs with CJK characters. Internally cutting and
pasting is implemented as exporting fragment of document to RTF and reading
piece of the document that was cut from RTF. So, unless RTF importer with CJK
characters doesn't work AW won't be able to paste anything.
 Belcon Zhao <rainfall@yeah.net> is tightly working on this problem for a
week or more (of course on other problems too, but this is the only problem
that left to solve thanks to Belcon). Contact Belcon for more information.
 So could you guys see what's wrong with it?

 I should say that current code works fine with singlebyte encodings (even in
case when current encoding and encoding used in RTF file differ). So I don't
have idea why it doesn't work for CJK.

 Here is my recommendations on how to research the problem:
* Type few Chinese characters (you may surround them with "AbiWord" to quickly
identify them in raw RTF)
* Save document as RTF (not RTF for old apps). Check that it's really rtf
 (giving it .rtf extension is not enough - type should be specified from popup).
* Try to import that rtf file. As I understand, incorrect Chinese character
are read.

 You can use just cut and paste - the same set of exporter and importer
functions will get called.

 The function that should be inspected:
IE_Imp_RTF::ParseChar(UT_UCSChar ch,bool no_convert=0) in
/src/wp/impexp/xp/ie_imp_RTF.cpp

 The first parameter is character that was read from .rtf (either raw or
converted to proper character from one specified in form \'hh (e.g. "\'a3"
will result in calling ParseChar(0xa3,0)) or as Unicode value as \uc0\uHHHH -
(e.g. \uc0\u3e9f that will result in call ParseChar(0x3e9f,1) ) ).
 The second parameter tells whether the character should be converted from
charset of RTF file or whether it's already unicode character (case 3 above -
\uc0\uHHHH form).

 The following is done inside that function (important part left)

UT_Bool IE_Imp_RTF::ParseChar(UT_UCSChar ch,bool no_convert)
{
        /* insure we are not chunk marked as "deleted" */
                                if (no_convert==0 && ch<=0xff)
                                {
                                        wchar_t wc;
                                        if (m_mbtowc.mbtowc(wc,(UT_Byte)ch))
                                                return AddChar(wc);
                                } else
                                        return AddChar(ch);
}

 Here AddChar() inserts Unicode character in the document (it works OK).
 m_mbtowc is of type UT_Mbtowc defined in /src//src/af/util/xp/ut_mbtowc.cpp
- a wrapper around iconv that converts characters from multibyte encoding of
RTF file (it's properly setup) to Unicode. Instances of this wrapper are used
in a lot of places (e.g. when converting input from keyboard or importing
plain text) and they work OK there. So I don't know why it doesn't work here.
The function 'int UT_Mbtowc::mbtowc(wchar_t &wc,char mb)' returns 1 if mb is
the terminator of already-agregated multibyte sequence (in this case it
returns proper value in value passed by reference as 1st parameter).
 Belcon tells that m_mbtowc.mbtowc(wc,(UT_Byte)ch) returns 1.

 The most probable reason why it doesn't work is that iconv_t member of
m_mbtowc is ((iconv_t)-1). Could you check that?
 The input charset for m_mbtowc is set twice - once at creation of IE_Imp_RTF
(it's set to current locale's charset) and the secon time - when \ansicpg is
seen - in IE_Imp_RTF::TranslateKeyword:
        switch (*pKeyword)
        {
        case 'a':
                if (strcmp((char*)pKeyword, "ansicpg") == 0)
                {
                        m_mbtowc.setInCharset(XAP_EncodingManager::instance->
                                charsetFromCodepage((UT_uint32)param));
                }
                break;
                /* [...] */
        }
 So you should ensure that XAP_EncodingManager::instance->
charsetFromCodepage((UT_uint32)param) returns name of charset libc knows. If
it returns charset name unknow to glibc, just tell me for what parameter it
should return what (and what it actually returns) and I will correct it
properly (or do it yourself - in /src/af/xap/xp/xap_EncodingManager.cpp) -
but write a quick hack in order not to wait for correct fix (from you or me),
and test it. Test cut and paste after this.

Also, please test (and fix :) the following:
* cutting from AW and pasting to other apps
* pasting to AW from other apps

 It seems other things are OK.
 But if you want to polish, add a correct header that will be written by AW
when exporting to Latex (function XAP_EncodingManager::getTexPrologue() and
the way TexPrologue is initialized in XAP_EncodingManager::initialize()).

 Also you also can check (and fix) that Word and other apps understand RTF
generated by AW and that AW understands their RTF.

 Please report any problems.
 Feel free to contact me directly if you have troubles.

 PS: Latest news: Belcon tells that with the patch to xap_UnixFont.cpp that
was commited last night AW shows characters in GB2312 without any problem.

 When testing, remember that your $LANG should contain the name of the
encoding (as understood by your iconv implementation) - e.g. "zh_CN.GB2312"

 Let's make AW CJK-aware!

 Thanks for your help in advance.

 Best regards,
  -Vlad



This archive was generated by hypermail 2b25 : Wed Nov 08 2000 - 13:33:07 CST