Re: RFC: bug #2742

From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Mon Mar 18 2002 - 09:37:43 EST

  • Next message: Dom Lachowicz: "Re: minor fixes"

     --- Tomas Frydrych <tomas@frydrych.uklinux.net>
    wrote: >
    > This bug surfaced in the MS Word importer, but it
    > will plague many
    > others. I has to do with handling the so called
    > mirror characters,
    > such as ( ) or [ ].
    >
    > In Unicode, mirror characters are defined
    > semantically. For
    > instance character u0028 is defined as 'opening
    > parenthesis'. In left-
    > to-right context an opening parenthesis looks like
    > '(', but in right-to-
    > left context it looks like ')'. AW uses this
    > semantic definition, and it
    > displays the correct glyph depending on the context.
    >
    > A problem arises when we import documents from some
    > other
    > format that does not use the semantic Unicode
    > definition. For
    > instance MS Word will not use u0028 for opening
    > parenthesis in
    > RTL context, but instead it will store u0029 in its
    > place (which is
    > 'closing parenthesis'). So, when we load the
    > document and analyse
    > the context, we will display a glyph for closing
    > parenthesis in RTL
    > context which is '(', while the author intended ')'.
    >
    > This is a serious bug that needs some fix before the
    > 1.0 release. I
    > see two possible avenues:
    >
    > (1) the MS Word importer carries out the analysis of
    > the context
    > and it translates any mirroring characters in RTL
    > context to the
    > correct Unicode values. The problem with this is
    > that (a) the
    > importer was not designed to analyse the context, it
    > handles a
    > character at a time and to get it do this properly
    > would not be
    > entirely simple, but surmountable (b) we will have
    > to redo this in
    > every importer that handles a file format with the
    > same problem
    > (plain text, etc.).
    >
    > (2) A second solution would be to add a method to
    > our edit
    > methods, which would scan through the document for
    > any mirror
    > characters in RTL context and replace them with
    > their mirror
    > images. This method would be called by the importer
    > once the
    > entire document is loaded. The main advantages of
    > this are (a) it
    > can be used by any importer that needs it; (b) when
    > the document
    > has been loaded, the context has already been
    > analysed, so that it
    > is easy to identify the offending characters. The
    > main disadvantage
    > is that the character-by-character scanning of the
    > document and
    > the deletion/insertion operations carried on the
    > offending characters
    > will prolong the document loading, which could be
    > noticeable on
    > large files. Also, with the incremental loader in
    > place, the initial
    > appearance of the document before the loading is
    > completed would
    > be incorrect (unless we can call this fixing method
    > while doing the
    > loading -- that should be possible, I think, if it
    > is part of the
    > BlockLayout class rather than a independent edit
    > method).
    >
    > I would appreciate some comments on this, especially
    > if someone
    > has a better idea of how to fix this.

    I'm strongly in favour of solution 1 above. If the
    problem is just with certain formats then we should
    fix only those formats. Especially when those
    formats otherwise support RTL documents. Otherwise
    we are second-guessing what user wanted which
    character
    in which position in which document. And that will
    surely lead to subtle bugs.

    Andrew Dunbar.

    =====
    http://linguaphile.sourceforge.net http://www.abisource.com

    __________________________________________________
    Do You Yahoo!?
    Everything you'll ever need on one web page
    from News and Sport to Email and Music Charts
    http://uk.my.yahoo.com



    This archive was generated by hypermail 2.1.4 : Mon Mar 18 2002 - 09:37:40 EST