Re: Error report of AW's CJK version and my ugly patch!


Subject: Re: Error report of AW's CJK version and my ugly patch!
From: Vlad Harchev (hvv@hippo.ru)
Date: Fri Nov 03 2000 - 01:15:41 CST


On Fri, 3 Nov 2000, Belcon wrote:

 Hello Belcon,

> After I apply Vlad's i18n patch(whose name is abi-oct24.patch),cjk.patch
> and next-cjk-patch.diff
> on clean AbiWord 0.7.11,I still found that AbiWord can't work with RTF
> format file normally.
> And so,we can't copy&paste CJK characters in AW,read and show CJK
> characters correctly.

 What do you mean by 'read & show'?

 Have your tried the most recent patch by me that was making use of
gdk_fontset_load() instead of gdk_font_load() in gr_UnixGraphics.cpp?

> Here is a patch to fixed this problem.I only test in Linux.This patch
> does such work:
>
> *** Fixed AW's problem with CJK RTF format file and copy&paste CJK
> characters.
> *** A quick fix to distinguish "zh_CN" from "zh_TW" on the
> "WinLanguageCode"
> (This part still needs work cause my patch on this part is really
> ugly)

 That is a rather hackish solution, the more generic one should be used in the
final code - just add a map that maps from lang_Territory.charset to windows
language code and look it up after looking in langinfo. If something is found,
then use found value, otherwise use value from langinfo. But don't bother
coding this all if you don't want (I can code it up in a minute) - use your
quick fix for now.

> *** Correct the {CP950,"GB2312"} {CP936,"Big5"} to {CP950,"Big5"}
> {CP936,"GB2312"}.

 My most recent patch I announced here (and cc'd to you) had these lines.
 
> BTW: It seemed that current AW CVS version hadn't apply all CJK patch
> of Vlad.After I install it,it even can't show CJK characters in
> menu and toolbar.:-(

 Wow! I didn't expect this..
 I will check it 8 hours later.

> Best regards,
> -Belcon

 Here are comments on your fix for ie_imp_RTF.cpp:

 The vital part of your patch is the following:
- ok = ParseChar(b,0);
+ if(m.mbtowc(wc,(char)b))
+ ok = ParseChar((UT_UCSChar)wc,0);

 The function IE_Imp_RTF::ParseChar(UT_UCSChar ch,bool no_convert)
 is supposed to do extacly the same. It has the following fragment:

 if ((ch >= 32 || ch == 9 || ch == UCS_FF || ch == UCS_LF) &&
!m_currentRTFState.m_charProps.m_deleted)
             {
                     if (no_convert==0 && ch<=0xff)
                     {
                             wchar_t wc;
                             if (m_mbtowc.mbtowc(wc,(UT_Byte)ch))
                                     return AddChar(wc);
                     } else
                             return AddChar(ch);
             }

 That should do exactly the same. Since original code "ok = ParseChar(b,0);"
 was calling ParseChar with 0 as 2nd arg, the following piece of code:

                             wchar_t wc;
                             if (m_mbtowc.mbtowc(wc,(UT_Byte)ch))
                                     return AddChar(wc);
 Should be executed.
 While writing it, I noticed a bug - that fragment should be

                             wchar_t wc;
                             if (m_mbtowc.mbtowc(wc,(UT_Byte)ch))
                                     return AddChar(wc);
                             else
                                     return UT_TRUE;
 So please try adding "else return UT_TRUE;" there and report results with
 only this modification (without fix you've proposed).

 Your way of fixing this problem is unacceptable because you use UT_Mbtowc
 that converts only from native charset to UCS. That's incorrect since under
 some conditions charset of characters in RTF file is different from native
 charset (it's almost always true for russian text). RTF parser, when found
 \ansicpg uses the ansi code page number to initialize the source charset of
 IE_Imp_RTF::m_mbtowc coverter using UT_Mbtowc::setInCharset(charsetname).

 Please try debugging teh reason why all this doesn't work. Before importing
 RTF file, set a breakpoint on UT_Mbtowc::setInCharset and check whether it
 correctly opens conversion description. Then set breakpoint on
 UT_Mbtowc::mbtowc to see that it gets correct characters from RTF file and
 converts them to the right UCS2 values.
 I should say that all this works fine for russian documents. They don't use
 multibyte strings, but importing of RTF documents wouldn't work if there was
 a bug in this code.

 Please report results or problems.

 Best regards,
  -Vlad



This archive was generated by hypermail 2b25 : Fri Nov 03 2000 - 01:38:02 CST