Re: CJK line breaking

From: Tomas Frydrych <tomasfrydrych_at_yahoo.co.uk>
Date: Tue Mar 15 2005 - 09:14:27 CET

Hi Roland,

>>(2) why is the westernCanBreak() called when cjkProp indicates break should
> not happen there;
>
> Consider this condition (executed if we're being asked about breaking after the
> character):
>
> if (cjkProp[1].cantEndLine || !cjkProp[1].cjk)
> return westernCanBreakAt(c[1], bAfter);
>
> First of all this says that if the current character is not a CJK character then
> we should use normal Western rules.
>
> Secondly, if the current character is one that cannot end a line (e.g. "{") then
> we also use the Western rules. The (expanded) Western rules say that you cannot
> break after a "{" and so the function returns false as it should.
>
> In fact the rules here in English and Chinese are the same. You can't end a line
> with "(,{,[" in English either, but because we have spaces we don't have to
> think about that explicitly. The above could equally be written:
>
> if (cjkProp[1].cantEndLine)
> return false;
> if (cjkProp[1].cjk)
> return westernCanBreakAt(c[1], bAfter)
>

Yes, that would be much better, both regarding clarity of the code and
efficiency: if we already know there should not be a break, we should
not make the unnecessary call to westernCanBreatAt(). When it comes to
the layout engine, efficiency matters a great deal. (I assume, a '!'
slipped out from the second condition).

> In fact, originally I didn't modify the Western line breaking rules, but it
> turned out that the GR_Graphics::canBreak function wasn't being called as I
> expected and so it was necessary. However, I now realise that there is a
> significant amount of duplication between the tables
> canBreakBefore/canBreakAfter and cjkCantStartLine/cjkCantEndLine.
>
> Another issue is that the patch as it stands assumes that there are CJK
> languages and Western language and nothing else. In principle abi should be able
> to support any language so this assumption seems a little awkward.
> At the moment abi seems to assume that you can make the break/don't break
> decision based on a single character. That isn't even true in English.

Yes, the methods in GR_Graphics are fallbacks for platforms which lack
support for complex scripts, etc., or where that support is not
implemented yet. They are meant to provide reasonable behaviour, but not
all the bells and whistles; to provide a full unicode compliance is no
small task, and the intention is to leave that to native libraries on
each platform. So on win32, we have GR_Win32USPGraphics::canBreak() that
just makes a call to the Uniscribe library regarding where break is or
is not acceptable (and the library considers more than a single
character). I am slowly working on a Pango graphics for the *nix version
to do the same for us there. The system APIs on Mac provide the needed
functionality, just someone needs to implement the necessary graphics in
due time.

> What I have in mind involves a function like the current charCJKProp() function
> which classifies every character into one of five classes:
>
> 1, atomic (Can stand alone like a Chinese character)
> 2, non-atomic (Generally forms part of a word like English let.)
> 3, can't end (Punctuation that can't end a line: { [ <
> 4, can't start(Punctuation that can't start a line: , . ) { > ]
> 5, break (Punctuation that can break whatever: " "
>
> Chinese characters can then be defined as atomic and, hopefully, similar
> characters in other languages can be as well once we know about them.
> Also, this way I can make the first test deal with spaces and non-atomic
> (like English) characters which should cover 80% of combinations that occur when
> typing in Western languages. That would address point 3 below.

It sounds like a very good approach, but keep in mind that it is only a
temporary measure (although temporary could be quite long). I intend to
get the Pango graphics ready for the 2.4 release, even though I am
making less progress at the moment than I would like due to my other
commitments, and once we have also the complex script support
implemented on Mac, I would like to make all the GR_Graphics functions
that related to complex scripts pure virtuals and remove all the related
character data from AbiWord.

Tomas
Received on Tue Mar 15 09:15:17 2005

This archive was generated by hypermail 2.1.8 : Tue Mar 15 2005 - 09:15:17 CET