From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Mon Mar 18 2002 - 09:37:43 EST
--- Tomas Frydrych <tomas@frydrych.uklinux.net>
wrote: >
> This bug surfaced in the MS Word importer, but it
> will plague many
> others. I has to do with handling the so called
> mirror characters,
> such as ( ) or [ ].
>
> In Unicode, mirror characters are defined
> semantically. For
> instance character u0028 is defined as 'opening
> parenthesis'. In left-
> to-right context an opening parenthesis looks like
> '(', but in right-to-
> left context it looks like ')'. AW uses this
> semantic definition, and it
> displays the correct glyph depending on the context.
>
> A problem arises when we import documents from some
> other
> format that does not use the semantic Unicode
> definition. For
> instance MS Word will not use u0028 for opening
> parenthesis in
> RTL context, but instead it will store u0029 in its
> place (which is
> 'closing parenthesis'). So, when we load the
> document and analyse
> the context, we will display a glyph for closing
> parenthesis in RTL
> context which is '(', while the author intended ')'.
>
> This is a serious bug that needs some fix before the
> 1.0 release. I
> see two possible avenues:
>
> (1) the MS Word importer carries out the analysis of
> the context
> and it translates any mirroring characters in RTL
> context to the
> correct Unicode values. The problem with this is
> that (a) the
> importer was not designed to analyse the context, it
> handles a
> character at a time and to get it do this properly
> would not be
> entirely simple, but surmountable (b) we will have
> to redo this in
> every importer that handles a file format with the
> same problem
> (plain text, etc.).
>
> (2) A second solution would be to add a method to
> our edit
> methods, which would scan through the document for
> any mirror
> characters in RTL context and replace them with
> their mirror
> images. This method would be called by the importer
> once the
> entire document is loaded. The main advantages of
> this are (a) it
> can be used by any importer that needs it; (b) when
> the document
> has been loaded, the context has already been
> analysed, so that it
> is easy to identify the offending characters. The
> main disadvantage
> is that the character-by-character scanning of the
> document and
> the deletion/insertion operations carried on the
> offending characters
> will prolong the document loading, which could be
> noticeable on
> large files. Also, with the incremental loader in
> place, the initial
> appearance of the document before the loading is
> completed would
> be incorrect (unless we can call this fixing method
> while doing the
> loading -- that should be possible, I think, if it
> is part of the
> BlockLayout class rather than a independent edit
> method).
>
> I would appreciate some comments on this, especially
> if someone
> has a better idea of how to fix this.
I'm strongly in favour of solution 1 above. If the
problem is just with certain formats then we should
fix only those formats. Especially when those
formats otherwise support RTL documents. Otherwise
we are second-guessing what user wanted which
character
in which position in which document. And that will
surely lead to subtle bugs.
Andrew Dunbar.
=====
http://linguaphile.sourceforge.net http://www.abisource.com
__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com
This archive was generated by hypermail 2.1.4 : Mon Mar 18 2002 - 09:37:40 EST