Re: MSWord DOC (fwd)


Subject: Re: MSWord DOC (fwd)
From: Vlad Harchev (hvv@hippo.ru)
Date: Sat Nov 11 2000 - 14:26:24 CST


On Sat, 11 Nov 2000, Dom Lachowicz wrote:

> I get "??word??" in AbiWord and I think that this is because I don't have
> any Chinese fonts or locales installed. In HTML, I get "ê¤ê¥wordê©ê¥" which
> looks Chinese enough for me :-)
>
> This is wierd. I can import Chinese word documents now just fine, but not US
> english ones... ugh.. need to fix this bug.

 I described how to test. Import .doc file, then save it as .abw and see raw
.abw file. The &#xHHHH will be the values that were imported. They are not
converted to Unicode in case of CJK and word6 format.

 I believe that the following piece of code is guilty:

ie_imp_MsWord_97.cpp:
int CharProc(wvParseStruct *ps,U16 eachchar,U8 chartype,U16 lid)
{
        [...]
           if (chartype)
             eachchar = wvHandleCodePage(eachchar, lid);
}

For that document, "chartype" is zero for some reason. So, may be if
fib.farEast is set, chartype should be always '1' (or may be for old word6
format documents only - since word8 format .doc with CJK was imported just
fine)?

BTW, in that case wv will try to open iconv with input charset "cp950" that is
unknown to glibc, so this should also be fixed ("big5" is the alias for
cp950).

>
> Dom
>
>
> >From: Vlad Harchev <hvv@hippo.ru>
> >To: Sam TH <sam@uchicago.edu>
> >CC: Chih-Wei Huang <cwhuang@linux.org.tw>, abiword-dev@abisource.com,
> > Dom Lachowicz <cinamod@hotmail.com>
> >Subject: Re: MSWord DOC (fwd)
> >Date: Sat, 11 Nov 2000 22:09:58 +0400 (SAMT)
> >
> >On Sat, 11 Nov 2000, Sam TH wrote:
> >
> > > On Sun, Nov 12, 2000 at 01:34:26AM +0800, Chih-Wei Huang wrote:
> > > > Sam TH ¼g¹D¡G
> > > > > Vlad, what exactly is this document supposed to look like? When I
> >run
> > > > > it thorough wv here, I get a document with about 7 CJK characters.
> >Is
> > > > > that anything like correct? What's the desired result?
> > > >
> > > > If you import it correctly, it should looks like this
> > > > http://cle.linux.org.tw/images/screen_shot/abiword-4.jpg
> > > >
> > > > That is, two Chinese characters, plus 'word',
> > > > and follows another two Chinese characters.
> > > >
> > >
> > > I get, when using wv, ??####?? where the #s are chinese characters (but
> > > not the ones in your screen shot) and the ?s are actual question marks.
> >
> > If you have new glibc (2.1.96 and above), then #s are byte-swapped
> >ascii characters (i.e. 'w'<<8, 'o'<<8, 'r'<<8,'d'<<8) and "?" is the even
> >funnier things I guess.
>

 Best regards,
  -Vlad



This archive was generated by hypermail 2b25 : Sat Nov 11 2000 - 14:45:35 CST