Re: undo and combining characters

From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Tue Apr 23 2002 - 00:03:43 EDT

  • Next message: Andrew Dunbar: "Hancom Office 2.01 for Linux"

     --- Paul Rohr <paul@abisource.com> wrote: > Again,
    thanks for the reference, Karl. Having
    > specific examples like this
    > really helps make the design discussion more
    > concrete.
    >
    > CAVEAT: I'm no expert on these issues, but I'm
    > trying to synthesize a
    > design principle which can be easily explained, so
    > the resulting behavior
    > will fall somewhere between the "Just Works" ideal
    > and a less ambitious "not
    > surprising" standard.
    >
    > At 03:28 PM 4/22/02 -0400, Karl Ove Hufthammer
    > wrote:
    > >Well, combining characters may be input in several
    > ways. On my
    > >Norwegian keyboard, I write é by pressing the Alt
    > Gr + 'the ´
    > >deadkey', followed by an e. (BTW, note that the
    > decomposed form of
    > >é in Unicode is e´, not ´e.) On French keyboards, I
    > believe there
    > >is a separate é key. But exactly how the keypress
    > --> character
    > >sequence is generated should be done by the OS.
    >
    > Agreed.

    Again, these are not combining characters - these are
    precomposed characters. I'm not aware of any system
    that returns a combining character sequence on
    pressing a single key or on using a dead key
    combination.
    An Arabic keyboard will return a sequence which
    represents a ligature however.

    > >As for undoing a decomposed character (e.g. e´), I
    > think it's safe
    > >to undo all characters back to (and including) the
    > last non-
    > >combining character. For example if you write e´
    > (where ´ is not
    > >actually ´, but the combining ´) and press undo,
    > both characters
    > >(which are probably displayed as one glyph) should
    > be deleted. (In
    > >practice é would/should be written as the
    > pre-composed é character,
    > >as per Normalization Form C <URL:
    > >http://www.unicode.org/unicode/reports/tr15/ >. I
    > only use it here
    > >as an exaple.)
    >
    > Given what you've said so far, here's a possible
    > design:
    >
    > Let the IME do as much composition work as it can.
    >
    > As far as AbiWord is concerned, the pre-composed
    > character is atomic.
    > That's what we store, render, select, format, and
    > undo.

    Agreed. Note that a CJK IME will usually return more
    than a single character. For example, entering
    "nihongo" with an English keyboard into a Japanese
    IME will result in the kanji sequence "ni+hon+go"
    which will be a string of 3 characters. This should
    be treated atomically by AbiWord.

    > So far, so good. That allows us to handle examples
    > A.1 and A.2, and happens
    > to be more-or-less what we've already got
    > implemented.
    >
    > >> What would a native speaker want to happen when
    > you "undo" the
    > >> entry of a single "on-screen" character?[1] I
    > suspect that
    > >> creating such an entity may take more than one
    > step (in the
    > >> input method editor), but should they always be
    > undone
    > >> individually?
    > >
    > >In case similar to my example above, yes. But not
    > always. See for
    > >example the romaji input example at <URL:
    > >http://www.w3.org/TR/charmod/#sec-CharExamples >.
    > How this should
    > >be handled is depedant on the actual input method
    > used.
    >
    > Do you have a specific proposal here?
    >
    > According to example A.3, as far as AbiWord is
    > concerned, all those Latin
    > characters are never seen. The IME intercepts and
    > translates them, handing
    > us 3 kanji characters at once.

    Correct.

    > Thus, the question becomes:
    >
    > - Are *those* characters atomic (for selection or
    > deletion purposes)?
    > - Should we glob them for undo purposes?

    They are not atomic for selection or deletion but they
    are for undo. Imagine it like pasting a string.
    Undo should unpaste the string.

    > Again, I'm not a native speaker, but I'd guess that
    > the answer to the first
    > question is yes. The second is less clear to me.
    >
    > but wait, there's more
    > ----------------------
    > Now we get closer to the screw cases I was worried
    > about.
    >
    > For example, consider example A.4. Since we're
    > letting the IME do the
    > necessary composition (or decomposition), we have no
    > way to differentiate
    > the keystrokes used to create the two lam-alef
    > ligatures here. Thus, should
    > undo:

    Arabic doesn't use an IME. To enter lam-alef you have
    two options: hit the lam key, then hit the alef key;
    or: hit the lam-alef key. For the former all OSes
    will
    see two keypresses each representing one character.
    For the second it may depend on the OS whether there
    is one event containing two charcodes or two seperate
    events...

    > - glob the "first" one (for a total of 4 steps),
    > or
    > - decompose the second one (for a total of 6
    > steps)?

    I'd be surprised if any OS returned a precomposed
    glyph but from memory the old Arabic encodings and
    Unicode do both support a precomposed lam+alef so
    maybe some OS does. The Arabic Windows I have played
    with definitely get two characters that I can delete
    individually.

    The easiest way to handle all of this is to never do
    any kind of normalizing ourselves yet. If the OS
    gives us a precomposed ligature we treat it as a
    single character, if the OS gives us two characters
    which the renderer may combine into a ligature we
    treat them as seperate characters. This is probably
    what the users are used to on their respective OSes
    now.

    > ( For anyone tempted to be tricky, yes we could
    > theoretically jigger the
    > undo records to differentiate them when originally
    > typed, but not after the
    > file's been stored and reloaded. Both behaviors
    > should be consistent, no? )

    Whether to normalize our files or even normalize each
    line as the return key is hit is probably worth
    thinking about at some point but probably not
    essential
    yet...

    > Even worse, how many undo steps should there be
    > after typing the Tamil word
    > in example A.5?
    >
    > - six?
    > - five?
    > - four?
    > - one? (ie, punt and just don't allow
    > character-level undo)

    There should be one undo for each keypress. Re-render
    after each undo.

    > Don't ask me, I don't speak or type either language.
    > :-)
    >
    > bottom line
    > -----------
    > Like Martin, I've been hoping that Unicode was
    > monstrous enough that we
    > could always expect to encounter fully-composed
    > characters in the piece
    > table. That way, undo and selections would create a
    > user experience that
    > would Just Work like our trivial Latin cases.
    >
    > Evidently, it's not that simple, which is why we
    > need answers to these
    > questions.

    Correct ):

    > If my suggestions in the selection case make sense,
    > then is there any reason
    > to allow undo granularity which is *finer* than the
    > selection granularity?

    Undo should undo each input event (keypress or IME).
    Select should select what looks like a typable unit.
    So undo granularity would be finer than selection
    granularity in the case of a Vietnamese a with tone
    mark. Select would select both. Undo would undo
    first the tone mark, then the a.

    > If so, when and how? What kind of user experience
    > will Just Work for the
    > cases Andrew is most worried about?

    We really need to attract a bunch of Asian, Indian,
    and Arabic language developers and/or testers!

    Andrew Dunbar.

    > OK folks, have at it!
    >
    > Paul

    =====
    http://linguaphile.sourceforge.net http://www.abisource.com

    __________________________________________________
    Do You Yahoo!?
    Everything you'll ever need on one web page
    from News and Sport to Email and Music Charts
    http://uk.my.yahoo.com



    This archive was generated by hypermail 2.1.4 : Tue Apr 23 2002 - 00:05:46 EDT