From: Tomas Frydrych (tomas@frydrych.uklinux.net)
Date: Sun Mar 17 2002 - 13:59:04 EST
This bug surfaced in the MS Word importer, but it will plague many
others. I has to do with handling the so called mirror characters,
such as ( ) or [ ].
In Unicode, mirror characters are defined semantically. For
instance character u0028 is defined as 'opening parenthesis'. In left-
to-right context an opening parenthesis looks like '(', but in right-to-
left context it looks like ')'. AW uses this semantic definition, and it
displays the correct glyph depending on the context.
A problem arises when we import documents from some other
format that does not use the semantic Unicode definition. For
instance MS Word will not use u0028 for opening parenthesis in
RTL context, but instead it will store u0029 in its place (which is
'closing parenthesis'). So, when we load the document and analyse
the context, we will display a glyph for closing parenthesis in RTL
context which is '(', while the author intended ')'.
This is a serious bug that needs some fix before the 1.0 release. I
see two possible avenues:
(1) the MS Word importer carries out the analysis of the context
and it translates any mirroring characters in RTL context to the
correct Unicode values. The problem with this is that (a) the
importer was not designed to analyse the context, it handles a
character at a time and to get it do this properly would not be
entirely simple, but surmountable (b) we will have to redo this in
every importer that handles a file format with the same problem
(plain text, etc.).
(2) A second solution would be to add a method to our edit
methods, which would scan through the document for any mirror
characters in RTL context and replace them with their mirror
images. This method would be called by the importer once the
entire document is loaded. The main advantages of this are (a) it
can be used by any importer that needs it; (b) when the document
has been loaded, the context has already been analysed, so that it
is easy to identify the offending characters. The main disadvantage
is that the character-by-character scanning of the document and
the deletion/insertion operations carried on the offending characters
will prolong the document loading, which could be noticeable on
large files. Also, with the incremental loader in place, the initial
appearance of the document before the loading is completed would
be incorrect (unless we can call this fixing method while doing the
loading -- that should be possible, I think, if it is part of the
BlockLayout class rather than a independent edit method).
I would appreciate some comments on this, especially if someone
has a better idea of how to fix this.
Tomas
This archive was generated by hypermail 2.1.4 : Sun Mar 17 2002 - 14:01:47 EST