Re: AbiWord Chinese version of Linux


Subject: Re: AbiWord Chinese version of Linux
From: Paul Rohr (paul@abisource.com)
Date: Thu Mar 30 2000 - 19:46:44 CST


At 10:30 AM 3/27/00 +0800, hj wrote:
> Top level window not support XIM. But s_ic and s_ic_attr must be static
>member. It will cause segment fault if I change to non-static. I don't know
>why.
> All Chinese and English Characters are encoded in unicode in abw.
>European languages are not encoded in unicode. In furture we display
>different languages in one document. So unicode encoding is needed.If you
>replace fonts.hj with european languages, Characters are unicode in abw.
> Chinese font files are too large to ship. I don't distribute Chinese
>fonts. I create a file "fonts.hj" in AbiWord font file that include Chinese
>printing font name, XLFD, printing font ascent, printing font descent and
>printing font width.
> All unixfonts are created as fontset not font. It can display both
>English and Chinese Character. Printing program can print both English and
>Chinese Character.
> We must resolve that keyval will be 0xffffff when I input Chinese with
>XIM. Chinese strings are stored in string not in keyval.

Thanks for the patch. I'm very very impressed at how you've tackled issues
throughout the tree to get Chinese working for you on Linux. My goal now is
to figure out how to integrate the work you've done with the work that will
be needed to add true Unicode support for other languages and/or platforms.

At this point, I'd like feedback from other developers in the following two
areas:

  - people working on related i18n issues (Henrik Berg, Vadim Frolov)
  - a random GTK expert or two

As soon as we've got some consensus that you all are heading in the same
direction, we can start getting some or all of this code checked in.

To get the discussion rolling, here are some observations (in no particular
order):

0. do you have a screen shot?
------------------------------
I'd totally love to *see* your version running.

1. UI translation
------------------
It's really cool to see that you've already translated most of the UI. I'm
presuming that the hex-encoded characters map directly to the appropriate
Unicode characters, and not some other charset, right?

  src/wp/ap/xp/ap_Menu_LabelSet_Languages.h
  src/wp/ap/xp/ap_Menu_LabelSet_ZhCN.h
  src/wp/ap/xp/ap_TB_LabelSet_Languages.h
  src/wp/ap/xp/ap_TB_LabelSet_ZhCN.h
  user/wp/strings/ZhCN.strings

How bad was it to do all the editing to generate an 8859-1 encoding of the
strings file? Would it have been easier for you to use one of expat's other
supported encodings instead?

  http://www.jclark.com/xml/expatfaq.html

For example, you can directly export UTF8 files from AbiWord. :-)

2. XIM on frame
----------------
Thanks for digging out the GTK apis for XIM support. Is there anything we'd
need to know to make these changes work for other languages besides Chinese?

  src/af/xap/unix/xap_UnixFrame.cpp
  src/af/xap/unix/xap_UnixFrame.h

Also, could you elaborate on what problems you were seeing with non-static
ICs?
Perhaps someone else on the list might be able to help.

3. coding style
----------------
It looks like there are a number of places where you added files and/or
functions, all of which had your initials as a prefix. Do you want your
code to stand out like this, or was that just to make it easier to read the
patch?

(We generally tend to try to write code so it all blends in together. That
way, you have to use Bonsai's cvsblame tool to see who was responsible for a
given line of code.)

4. files to ignore
-------------------
I noticed that there were a bunch of files in your patch which included
changes which probably shouldn't be checked in. For example,

  src/af/xap/Makefile
  src/af/xap/unix/xap_UnixDlg_About.cpp

In addition, a bunch of spurious diffs were generated by RCS_ID variations.
(Does anyone know of an option to suppress these?)

5. some languages don't ever get spell-checked
-----------------------------------------------
I also noticed that you've implemented quick hacks to avoid spell-checking
chinese content.

  src/text/fmt/xp/fl_BlockLayout.cpp
  src/wp/ap/xp/ap_Dialog_Spell.cpp

Is there a more general way to do this check? Do we want to explicitly tag
content by language (via the lang attribute), or will it be enough to just
ignore certain Unicode ranges?

6. pairing unrelated fonts
---------------------------
This one's going to sound pretty ignorant, so please forgive me.

I'm not sure I completely understand why you've implemented the logic to
pair up English and Chinese fonts as if they were the same font (as far as
the UI is concerned).

  src/af/xap/unix/xap_UnixFont.cpp
  src/af/xap/unix/xap_UnixFont.h
  src/af/xap/unix/xap_UnixFontManager.cpp
  src/af/xap/unix/xap_UnixFontManager.h
  src/af/xap/unix/xap_UnixPSGraphics.cpp
  src/af/xap/unix/xap_UnixPSGraphics.h

I'm used to using WYSIWYG editors, where users choose to use one font at a
time, switching to others as needed. Any time you use a character which
isn't provided in that font, you get a slug character.

From what little I know of fontsets, the idea is that you explicitly
assemble a collection of overlapping fonts and give that *set* of fonts a
name. IIRC, GTK has mechanisms to do this, but I'm not sure whether that
helps you much, since you have to generate PS output, too.

(It's bad enough to do a 1-to-1 WYSIWYG mapping between screen fonts and
printer fonts. Mapping collections of fontsets sounds like a nightmare.)

Again, my goal here is to understand how to take what you've done and use it
to solve similar problems for other languages.

7. multibyte / wide character conversions
------------------------------------------
I suspect that this stuff is likely to be the most controversial. There are
a number of places in the code where you've introduced locale-specific
variants of UCS <--> char conversions via mbtowc() and wctomb().

  mbtowc
  ------
  src/af/ev/unix/ev_UnixKeyboard.cpp

  wctomb
  ------
  src/af/gr/unix/gr_UnixGraphics.cpp

  UCS <--> char (via wc/mb)
  -------------
  src/af/util/Makefile
  src/af/util/xp/Makefile
  src/af/util/xp/hj.cpp
  src/af/util/xp/hj.h
  src/af/util/xp/hj_mbtowc.cpp
  src/af/util/xp/hj_mbtowc.h
  src/af/util/xp/hj_wctomb.cpp
  src/af/util/xp/hj_wctomb.h

  src/text/fmt/xp/fp_TextRun.cpp
  src/wp/ap/unix/ap_UnixDialog_Replace.cpp
  src/wp/ap/xp/ap_EditMethods.cpp

To be honest, I'm not sure how this approach compares to the iconv-oriented
stuff which Henrik and Vadim have been working on. I'm sure you're each
working on real problems, but I frankly don't understand enough about what
any of you are doing to be able to judge the merits of each approach.

Could the three of you start a discussion to help get ignorant Americans
like me up to speed? ;-)

8. should plain text be anything other than ASCII?
---------------------------------------------------
On a similar note, it looks like you've extended a bunch of logic which
currently reads Latin-1 files to also handle other encodings, albeit in a
locale-specific way.

  src/af/xap/xp/xap_Strings.cpp
  src/wp/ap/xp/ap_Strings.cpp
  src/wp/impexp/xp/ie_exp_Text.cpp
  src/wp/impexp/xp/ie_imp_MsWord_97.cpp
  src/wp/impexp/xp/ie_imp_Text.cpp

This makes me kind of nervous, because it means that the actual contents of
the files being read and written are interpreted as being in different
charsets, depending on your locale settings at runtime.

Up until now, we've been striving to create totally-portable files, which
are always in the same encoding no matter where you read or write them.
(Thus, for example, note how we've differentiated 7-bit text files from UTF8
text files.)

bottom line
-----------
You've obviously put a lot of hard work into this patch, and I really really
want to be able to start bragging about the fact that we support Chinese on
at least one platform. That's *so* cool!

To be honest, I'm not sure that all of the issues I've mentioned above are
actually real. However, at the moment, I don't know enough to be able to
decide how much of this patch to integrate into the tree.

Could the various folks working on i18n issues help clear up some of my
confusion here?

Thanks,
Paul



This archive was generated by hypermail 2b25 : Thu Mar 30 2000 - 19:41:13 CST