From: Martin Sevior (msevior@mccubbin.ph.unimelb.edu.au)
Date: Mon Mar 18 2002 - 01:50:14 EST
On Sun, 17 Mar 2002, Tomas Frydrych wrote:
> 
> This bug surfaced in the MS Word importer, but it will plague many 
> others. I has to do with handling the so called mirror characters, 
> such as ( ) or [ ].
> 
> In Unicode, mirror characters are defined semantically. For 
> instance character u0028 is defined as 'opening parenthesis'. In left-
> to-right context an opening parenthesis looks like '(', but in right-to-
> left context it looks like ')'. AW uses this semantic definition, and it 
> displays the correct glyph depending on the context.
> 
> A problem arises when we import documents from some other 
> format that does not use the semantic Unicode definition. For 
> instance MS Word will not use u0028 for opening parenthesis in 
> RTL context, but instead it will store u0029 in its place (which is 
> 'closing parenthesis'). So, when we load the document and analyse 
> the context, we will display a glyph for closing parenthesis in RTL 
> context which is '(', while the author intended ')'.
> 
OK I see the problem. Thanks.
> This is a serious bug that needs some fix before the 1.0 release. I 
> see two possible avenues:
> 
Well of course every one has their priorities. A higher priority for me
would be ensure we get MS sWord tyles loaded correctly and to loader MS
header/footer info into Abi.
> (1) the MS Word importer carries out the analysis of the context 
> and it translates any mirroring characters in RTL context to the 
> correct Unicode values. The problem with this is that (a) the 
> importer was not designed to analyse the context, it handles a 
> character at a time and to get it do this properly would not be 
> entirely simple, but surmountable (b) we will have to redo this in 
> every importer that handles a file format with the same problem 
> (plain text, etc.).
> 
This is a pain.
> (2) A second solution would be to add a method to our edit 
> methods, which would scan through the document for any mirror 
> characters in RTL context and replace them with their mirror 
> images. This method would be called by the importer once the 
> entire document is loaded. The main advantages of this are (a) it 
> can be used by any importer that needs it; (b) when the document 
> has been loaded, the context has already been analysed, so that it 
> is easy to identify the offending characters. The main disadvantage 
> is that the character-by-character scanning of the document and 
> the deletion/insertion operations carried on the offending characters 
> will prolong the document loading, which could be noticeable on 
> large files. Also, with the incremental loader in place, the initial 
> appearance of the document before the loading is completed would 
> be incorrect (unless we can call this fixing method while doing the 
> loading -- that should be possible, I think, if it is part of the 
> BlockLayout class rather than a independent edit method).
> 
It should not be done in the formatter. We could do a scan at the
piecetable level just after the document has loaded. Do these changes
without throwing change records so we don't get into into issues with
undo.
The Piecetable already has the document properties and you can keep track
of the state of left to righted-ness by examining the properties of the
frags as you step through the document.
Justa caution though, suppose an auther in Hewbrew wanted to place a
")(" in his document, would your algorithim detect this and not make the
change?
Also is it possible to quickly detect if a document has any RTL durng
import so we don't have to scan ordinary docs?
Cheers
Martin
This archive was generated by hypermail 2.1.4 : Mon Mar 18 2002 - 01:50:19 EST