Re: request for help from CJK hackers


Subject: Re: request for help from CJK hackers
From: Belcon Zhao (belcon@hotmail.com)
Date: Wed Nov 08 2000 - 19:51:30 CST


Hello Vlad

>From: Vlad Harchev <hvv@hippo.ru>
>To: abiword-dev@abisource.com
>CC: Belcon Zhao <belcon@hotmail.com>, Belcon <rainfall@yeah.net>,
>hashao <hashao@china.com>, Chih-Wei Huang <cwhuang@linux.org.tw>, hj
><huangj@citiz.net>
>Subject: request for help from CJK hackers
>Date: Wed, 8 Nov 2000 23:13:46 +0400 (SAMT)
>
> Hi guys,
>
> It seems AbiWord-0.7.12 will be released in the begining of next week, so
>it
>would be nice if all CJK issues were worked out.
> It seems that the major thing to be done to make AW seamlessly support
>CJK
>languages is import of RTFs with CJK characters. Internally cutting and
>pasting is implemented as exporting fragment of document to RTF and reading
>piece of the document that was cut from RTF. So, unless RTF importer with
>CJK
>characters doesn't work AW won't be able to paste anything.
> Belcon Zhao <rainfall@yeah.net> is tightly working on this problem for a
>week or more (of course on other problems too, but this is the only problem
>that left to solve thanks to Belcon). Contact Belcon for more information.
> So could you guys see what's wrong with it?
>
> I should say that current code works fine with singlebyte encodings (even
>in
>case when current encoding and encoding used in RTF file differ). So I
>don't
>have idea why it doesn't work for CJK.
>
> Here is my recommendations on how to research the problem:
>* Type few Chinese characters (you may surround them with "AbiWord" to
>quickly
>identify them in raw RTF)
>* Save document as RTF (not RTF for old apps). Check that it's really rtf
> (giving it .rtf extension is not enough - type should be specified from
>popup).
>* Try to import that rtf file. As I understand, incorrect Chinese character
>are read.
>
> You can use just cut and paste - the same set of exporter and importer
>functions will get called.
>
> The function that should be inspected:
>IE_Imp_RTF::ParseChar(UT_UCSChar ch,bool no_convert=0) in
>/src/wp/impexp/xp/ie_imp_RTF.cpp
>
> The first parameter is character that was read from .rtf (either raw or
>converted to proper character from one specified in form \'hh (e.g. "\'a3"
>will result in calling ParseChar(0xa3,0)) or as Unicode value as \uc0\uHHHH
>-
>(e.g. \uc0\u3e9f that will result in call ParseChar(0x3e9f,1) ) ).
> The second parameter tells whether the character should be converted from
>charset of RTF file or whether it's already unicode character (case 3 above
>-
>\uc0\uHHHH form).
>
> The following is done inside that function (important part left)
>
>UT_Bool IE_Imp_RTF::ParseChar(UT_UCSChar ch,bool no_convert)
>{
> /* insure we are not chunk marked as "deleted" */
> if (no_convert==0 && ch<=0xff)
> {
> wchar_t wc;
> if
>(m_mbtowc.mbtowc(wc,(UT_Byte)ch))
> return AddChar(wc);
> } else
> return AddChar(ch);
>}
>
> Here AddChar() inserts Unicode character in the document (it works OK).
> m_mbtowc is of type UT_Mbtowc defined in
>/src//src/af/util/xp/ut_mbtowc.cpp
>- a wrapper around iconv that converts characters from multibyte encoding
>of
>RTF file (it's properly setup) to Unicode. Instances of this wrapper are
>used
>in a lot of places (e.g. when converting input from keyboard or importing
>plain text) and they work OK there. So I don't know why it doesn't work
>here.
>The function 'int UT_Mbtowc::mbtowc(wchar_t &wc,char mb)' returns 1 if mb
>is
>the terminator of already-agregated multibyte sequence (in this case it
>returns proper value in value passed by reference as 1st parameter).
> Belcon tells that m_mbtowc.mbtowc(wc,(UT_Byte)ch) returns 1.
>
> The most probable reason why it doesn't work is that iconv_t member of
>m_mbtowc is ((iconv_t)-1). Could you check that?
> The input charset for m_mbtowc is set twice - once at creation of
>IE_Imp_RTF
>(it's set to current locale's charset) and the secon time - when \ansicpg
>is
>seen - in IE_Imp_RTF::TranslateKeyword:
> switch (*pKeyword)
> {
> case 'a':
> if (strcmp((char*)pKeyword, "ansicpg") == 0)
> {
> m_mbtowc.setInCharset( XAP_EncodingManager::instance->
> charsetFromCodepage((UT_uint32)param));
> }
> break;
> /* [...] */
> }

Yeah.I got it.Here is the reason why m_mbtowc always return 1,Vlad.
After I set debug message and found that here param is just 936 for
GB2312.Here we read "\ansicpg936"(for GB2312) from rtf file,and we
seperate 936 from the string and set param=936.But we don't expect
this result,IMHO.Vlad,I guess you want to get param=0x804(for GB2312)
or 0x404(for Big5),then ***XAP_EncodingManager::instance->
charsetFromCodepage((UT_uint32)param))*** return GB2312 or Big5.But if
param=936,it will ***always*** return CP1252.So,our character is
set in a wrong way.
Vlad,I am just curious that how your Russian Characters work fine.:-)
As I know,it should return CP1251 for Russian.
Here is my quick fix for GB2312&Big5:

if (strcmp((char*)pKeyword, "ansicpg") == 0)
{
        if(param==950)
                                
m_mbtowc.setInCharset(XAP_EncodingManager::instance->charsetFromCodepage((UT_uint32)0x404));
                        else if(param==936)
                                
m_mbtowc.setInCharset(XAP_EncodingManager::instance->charsetFromCodepage((UT_uint32)0x804));
                        else
                                
m_mbtowc.setInCharset(XAP_EncodingManager::instance->charsetFromCodepage((UT_uint32)param));
}
Of course,here still needs work.I am not familar with your class,Vlad.
I can't convert codepage(CP936 or CP950) to charactset.
But here still has a question that I report to you yesterday,that is
the sequence of English Characters and Chinese Characters.I am debugging
now.If I have result,I will report to you.

> So you should ensure that XAP_EncodingManager::instance->
>charsetFromCodepage((UT_uint32)param) returns name of charset libc knows.
>If
>it returns charset name unknow to glibc, just tell me for what parameter it
>should return what (and what it actually returns) and I will correct it
>properly (or do it yourself - in /src/af/xap/xp/xap_EncodingManager.cpp) -
>but write a quick hack in order not to wait for correct fix (from you or
>me),
>and test it. Test cut and paste after this.
>
>Also, please test (and fix :) the following:
>* cutting from AW and pasting to other apps
>* pasting to AW from other apps
>
> It seems other things are OK.
> But if you want to polish, add a correct header that will be written by
>AW
>when exporting to Latex (function XAP_EncodingManager::getTexPrologue() and
>the way TexPrologue is initialized in XAP_EncodingManager::initialize()).
>
> Also you also can check (and fix) that Word and other apps understand RTF
>generated by AW and that AW understands their RTF.
>
> Please report any problems.
> Feel free to contact me directly if you have troubles.
>
> PS: Latest news: Belcon tells that with the patch to xap_UnixFont.cpp
>that
>was commited last night AW shows characters in GB2312 without any problem.
>
> When testing, remember that your $LANG should contain the name of the
>encoding (as understood by your iconv implementation) - e.g. "zh_CN.GB2312"
>
> Let's make AW CJK-aware!
>
> Thanks for your help in advance.
>
> Best regards,
> -Vlad
>
BTW: CJK's font is not Type1 font.It is Truetype Font.So AW of CJK version
depends on what Truetype fonts you have installed in your
system.
  Should we add a function to install Truetype Font in AW automatically?Just
my opinion.

Best regards!
-Belcon
_________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com.

Share information about yourself, create your own public profile at
http://profiles.msn.com.



This archive was generated by hypermail 2b25 : Wed Nov 08 2000 - 19:51:35 CST