RFC: bug #2742

From: Tomas Frydrych (tomas@frydrych.uklinux.net)
Date: Sun Mar 17 2002 - 13:59:04 EST

  • Next message: Alan Horkan: "Re: GNU license"

    This bug surfaced in the MS Word importer, but it will plague many
    others. I has to do with handling the so called mirror characters,
    such as ( ) or [ ].

    In Unicode, mirror characters are defined semantically. For
    instance character u0028 is defined as 'opening parenthesis'. In left-
    to-right context an opening parenthesis looks like '(', but in right-to-
    left context it looks like ')'. AW uses this semantic definition, and it
    displays the correct glyph depending on the context.

    A problem arises when we import documents from some other
    format that does not use the semantic Unicode definition. For
    instance MS Word will not use u0028 for opening parenthesis in
    RTL context, but instead it will store u0029 in its place (which is
    'closing parenthesis'). So, when we load the document and analyse
    the context, we will display a glyph for closing parenthesis in RTL
    context which is '(', while the author intended ')'.

    This is a serious bug that needs some fix before the 1.0 release. I
    see two possible avenues:

    (1) the MS Word importer carries out the analysis of the context
    and it translates any mirroring characters in RTL context to the
    correct Unicode values. The problem with this is that (a) the
    importer was not designed to analyse the context, it handles a
    character at a time and to get it do this properly would not be
    entirely simple, but surmountable (b) we will have to redo this in
    every importer that handles a file format with the same problem
    (plain text, etc.).

    (2) A second solution would be to add a method to our edit
    methods, which would scan through the document for any mirror
    characters in RTL context and replace them with their mirror
    images. This method would be called by the importer once the
    entire document is loaded. The main advantages of this are (a) it
    can be used by any importer that needs it; (b) when the document
    has been loaded, the context has already been analysed, so that it
    is easy to identify the offending characters. The main disadvantage
    is that the character-by-character scanning of the document and
    the deletion/insertion operations carried on the offending characters
    will prolong the document loading, which could be noticeable on
    large files. Also, with the incremental loader in place, the initial
    appearance of the document before the loading is completed would
    be incorrect (unless we can call this fixing method while doing the
    loading -- that should be possible, I think, if it is part of the
    BlockLayout class rather than a independent edit method).

    I would appreciate some comments on this, especially if someone
    has a better idea of how to fix this.

    Tomas



    This archive was generated by hypermail 2.1.4 : Sun Mar 17 2002 - 14:01:47 EST