BiDi and script support


Subject: BiDi and script support
From: Tomas Frydrych (tomas@frydrych.freeserve.co.uk)
Date: Mon Apr 03 2000 - 15:35:17 CDT


Thanks to Joaquín for pointing me to Pango.

I would like to propose an outline of an approch to implement support for
BiDirectional documents and scripts in AbiWord. Please let me know what
you think. (I appoligise ahead for this being so wordy, I certainly do not plan
to make a habit of it...)

Why Bi-directional support and support for scripted languages ?
---------------------------------------------------------------
Currently there is no wordprocessing application that would offer
decent and flexible support for creation of documents that require these
capabilities, commercial or free, on any platform, or at least I am
not really aware of any and I have been looking for one for a while :-(.
Flexibility is the key term here, it must be possible to add support for new
languages by the user in a simple and non-technical manner, so that even
small ethnic groups or academics with unusual field of work could use it (the
approach suggested below would ensure that). There is a real need for such
an application, but the usual commercial developers seem to be ignorant to
this need, or simply lack the (financial?) incentives.

Bi-Directional support (BiDiS) - basic concepts
---------------------------------------------
(1) BiDiS makes no impact on the doc part of the doc/view model. All text is
stored sequentially in the natural order of entry, i.e., left-to-right (ltr)
text is stored from left to right and right-to-left (rtl) is stored from right
to left. (2) On the view level a decision is made for each chunk of text
whether it is to be displayed from left to right or from right to left. This
can be done either (a) using explicit text-formating attribute, or (b) each
font can be internally associated with direction of writing; b is the
better option, making it more transparent to the user and thus easier to use.
A new attribute will be characteristic of each block, which we can call
'prevailing direction of text'. It can be either user-defined, or derived from
the default font of the paragraph (if there is such a thing), or from the font
used with the 1st character of the paragraph. Again, I think, one of the
latter two methods is preferable as it means transparency to the user and also
will require no changes to the underlying XML set of attributes.

The messy part is soft line breaking in mixed direction paragraphs. Within
each paragraph a special attention needs to be given to any chunk of text
which has a different direction of writing than the prevailing one if the line
break falls within it. (Use could be made of the FreeBiDi library, but, it might
be more practical to write the necessary algorithms from scratch and to taylor
them to the Abi framework)

In addition to the BiDi support in the view, exporters into formats that do not
support rtl text (most, if not all) should ideally be modified, so that all rtl text
is reversed in the output format and when a soft line break falls within rtl
text it must replaced with a hard break. This makes the text to behave as ltr,
otherwise the non-rtl aware application will make a real mess of it.

Scripted alphabets - suggested approach
---------------------------------------

By 'scripted alphabet', or simply a script, I mean an alphabet where each
letter has got several different, context-sensitive forms. Typically, there
are four forms, initial, medial, final and independent, although some letters
may not have the full set. I suggest the following approach:

(1) each letter is represented in the doc by a single character irrespective
of the context. I shall call this representation the 'underlying string'. The
appearance is handled by a rendering engine built into the view, which will
translate the character to the glyph in the font which is contextually
correct. I will call this representation of the text the 'surface string'. The
underlying string is handled by the doc, the surface string by the view.

(2) An excellent system of rendering for scripts along these lines has already
been developed by Summer Institute of Linguistics (SIL, www.sil.org). Each
script/font is associated with a 'script definition file', SDF. This is a
plain text file containing rules for translation of underlying strings to
surface strings (and vice versa). The API of the engine contains functions
which translate the the underlying string to its surface form (or vice versa),
and perform a number of other functions, such as calculating the position in
the surface string that corresponds to a position in the underlying string (the
two strings may not, and often are not, of the same length.)

I suggest to use this engine for the following reasons: (a) the engine was
developed by linguists for linguists and it really shows in the design; (b) the
engine is very fast, it is capable of handling SDF files containing several
thousand rules even on a slower machine, this allows it to be used even with
languages which require fonts with several thousands of glyphs; (c) the
engine has sophisticated 'wild-card' capabilities that allow handling of complex
scripts very effectively -- I believe that it is the best piece of code of its type
currently available; (d) there are already a number of SDF files available for
variety of languages/fonts and in addtion, SDF files are extremely easy to
create, SIL even has a free GUI editor to facilitate this. (d) the engine
supports fully both ltr and rtl languages and some of its API could be used
effectively in implementing BiDiS. (f) the engine supports Unicode both on
input and output, but it can be equally used with non-Unicode text. (g) the
engine has been around for several years and has been extensively tested.

The possible snag: the SIL engine is copyrighted, and it is not Open Source,
but it can be used and distributed freely. It exists in a Win form (a dll) and
I have obtained a permission to port it to Linux (I have an alpha version of
the so ready for more extensive testing). I am also fairly confident that
terms under which the source code could be released for porting to other
platforms could be negotiated, in my experience the people at SIL are very
reasonable; further, SIL, which is a non-profit organisation, has been
developing a simple wordprocessor around the rendering engine, but the
project has been progressing slowly, mainly, I think, due to lack of human
resources -- they might have serious interest in cooperation with the Abi
project because AbiWord could fulfill the needs of their field linguists.

There are other rendering systems. A possible alternative mentioned by
Joaquín is Pango. It looks promising, but it is in an early stage of
development. Further, my main reservation about it is that you have to be a
programmer to implement support for a new language. This is where the SIL
approach beats all, it is extremely flexible and requires nothing but the
knowledge of the language to extend it -- you could use SDF files even with
languages that do not require rendering to implement things like autotext or for
redefinition of keyboard to allow access to some special characters, all
entirely under the user's control. (In fact I am so impressed with the SIL
rendering engine, that I think that it would be worth of writing a comaptible
engine from scratch if the non-Open Source nature of it was a problem.)

What it would mean in more practical terms (incomplete, but the main stuff)
----------------------------------------------------------------------
* modification to LineBreaker class to handle mixed direction text
* new class handling association of fonts with SDF files; only one instance is
  needed/should exist per aplication, but its functionality has to be
  accessible from all relevant places.
* adding mechanism through which all text formated with fonts requiting
  rendering will be rendered in the view and rtl runs of the text will be
  reversed before being passed onto the platform-specific code that outputs it
  on the screen, printer, etc; modification of the mechanism that moves the
  caret around; this way the whole behaviour will be entirely transparent to
  the platform-specific code, which will not need to be modified in any way



This archive was generated by hypermail 2b25 : Mon Apr 03 2000 - 15:37:19 CDT