CJK line breaking

From: Roland Kay <roland.kay_at_ox.compsoc.net>
Date: Tue Mar 01 2005 - 01:27:30 CET

Hi,

I'm new to this list so I'd better briefly introduce myself. I'm a teacher in
Beijing who uses Abiword for writing both English and Chinese documents.
Abi is a great word processor for English, but it has quite a lot of problems
dealing with multi-lingual documents. Since I have an interest in this I've been
trying to fix some of the CJK related bugs (#8500,#8499,#8468). At the moment
I'm taking a look at the Chinese line breaking rules.

Currently there are two problems with this:

a, CJK line breaking is only enabled if the user is in a CJK locale. This means
that users in the UK, for instance, who want to write in Chinese have a problem.

b, The current line breaking rule breaks between every character. This is a good
approximation, but it's not quite right. For example, you can never start a line
with a comma, so if the first character on the new line would be a ", " then you
need to bring the preceding character down as well.

a, is easy to fix. b, seems to necessitate a knowledge of the following
character when deciding if we can break after a character. Eg: we can break
after "wo" (我) unless the next character is a ", ".

I'm trying to put together a demo patch to show how this would work. I'm not
very familiar with the back end though so I was wondering if someone could give
me a quick tip:

What is the best way in GR_Graphics::canBreak() to get the value of the
character in the document after the current one?

For example, the piece of code below doesn't seem to work. The second call to
getChar() always returns UT_IT_ERROR.

Note, this is not beautiful code; I'm just playing around at the moment.

Any ideas would be greatly appreciated.

Best wishes,

Roland.

>8-------------------------------------------------------------------8<

bool GR_Graphics::canBreak(GR_RenderInfo & ri, UT_sint32 &iNext, bool /* bAfter */, UT_UCS4Char c2)
{
        bool retval;
        iNext = -1; // we do not bother with this
        UT_return_val_if_fail(ri.m_pText && ri.m_pText->getStatus() == UTIter_OK, false);

        *(ri.m_pText) += ri.m_iOffset;
        UT_return_val_if_fail(ri.m_pText->getStatus() == UTIter_OK, false);

        /*
         * For CJK we need to consider both this character and the next one.
         */
        UT_UCS4Char c = ri.m_pText->getChar();
        UT_uint32 iPos = ri.m_pText->getPosition();
        ri.m_pText->setPosition(iPos+1);
        UT_UCS4Char c2 = ri.m_pText->getChar();
        ri.m_pText->setPosition(iPos);
        UT_DEBUGMSG(("canBreak: char1: %x, char2: %x\n",c,c2));

        // Is this a CJK character? (Note these values may be incorrect)
        if ((c>0x3400 && c<0x4dbf) || (c>0x4e00 && c<0x9faf) ||
            (c>0xf900 && c<0xfaff) || (c>0xfe30 && c<0xfe4f) ||
            (c>0x20000 && c<0x2a6df) || c==0xff0c)
        {
                if (c!=0xff0c && c2==0xff0c)
                        return false;
                return true;
        }

        UT_return_val_if_fail(getApp(), false);
        return getApp()->getEncodingManager()->can_break_at(c);
}
Received on Tue Mar 1 01:32:29 2005

This archive was generated by hypermail 2.1.8 : Tue Mar 01 2005 - 01:32:29 CET