Re: undo and combining characters

From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Tue Apr 23 2002 - 00:03:43 EDT

Next message: Andrew Dunbar: "Hancom Office 2.01 for Linux"

Previous message: Andrew Dunbar: "Re: selections and combining characters"
In reply to: Paul Rohr: "Re: undo and combining characters"
Next in thread: Karl Ove Hufthammer: "Re: undo and combining characters"
Next in thread: Andrew Dunbar: "Re: undo and combining characters"
Reply: Karl Ove Hufthammer: "Re: undo and combining characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

--- Paul Rohr <paul@abisource.com> wrote: > Again,
thanks for the reference, Karl. Having
> specific examples like this
> really helps make the design discussion more
> concrete.
>
> CAVEAT: I'm no expert on these issues, but I'm
> trying to synthesize a
> design principle which can be easily explained, so
> the resulting behavior
> will fall somewhere between the "Just Works" ideal
> and a less ambitious "not
> surprising" standard.
>
> At 03:28 PM 4/22/02 -0400, Karl Ove Hufthammer
> wrote:
> >Well, combining characters may be input in several
> ways. On my
> >Norwegian keyboard, I write é by pressing the Alt
> Gr + 'the ´
> >deadkey', followed by an e. (BTW, note that the
> decomposed form of
> >é in Unicode is e´, not ´e.) On French keyboards, I
> believe there
> >is a separate é key. But exactly how the keypress
> --> character
> >sequence is generated should be done by the OS.
>
> Agreed.

Again, these are not combining characters - these are
precomposed characters. I'm not aware of any system
that returns a combining character sequence on
pressing a single key or on using a dead key
combination.
An Arabic keyboard will return a sequence which
represents a ligature however.

> >As for undoing a decomposed character (e.g. e´), I
> think it's safe
> >to undo all characters back to (and including) the
> last non-
> >combining character. For example if you write e´
> (where ´ is not
> >actually ´, but the combining ´) and press undo,
> both characters
> >(which are probably displayed as one glyph) should
> be deleted. (In
> >practice é would/should be written as the
> pre-composed é character,
> >as per Normalization Form C <URL:
> >http://www.unicode.org/unicode/reports/tr15/ >. I
> only use it here
> >as an exaple.)
>
> Given what you've said so far, here's a possible
> design:
>
> Let the IME do as much composition work as it can.
>
> As far as AbiWord is concerned, the pre-composed
> character is atomic.
> That's what we store, render, select, format, and
> undo.

Agreed. Note that a CJK IME will usually return more
than a single character. For example, entering
"nihongo" with an English keyboard into a Japanese
IME will result in the kanji sequence "ni+hon+go"
which will be a string of 3 characters. This should
be treated atomically by AbiWord.

> So far, so good. That allows us to handle examples
> A.1 and A.2, and happens
> to be more-or-less what we've already got
> implemented.
>
> >> What would a native speaker want to happen when
> you "undo" the
> >> entry of a single "on-screen" character?[1] I
> suspect that
> >> creating such an entity may take more than one
> step (in the
> >> input method editor), but should they always be
> undone
> >> individually?
> >
> >In case similar to my example above, yes. But not
> always. See for
> >example the romaji input example at <URL:
> >http://www.w3.org/TR/charmod/#sec-CharExamples >.
> How this should
> >be handled is depedant on the actual input method
> used.
>
> Do you have a specific proposal here?
>
> According to example A.3, as far as AbiWord is
> concerned, all those Latin
> characters are never seen. The IME intercepts and
> translates them, handing
> us 3 kanji characters at once.

Correct.

> Thus, the question becomes:
>
> - Are *those* characters atomic (for selection or
> deletion purposes)?
> - Should we glob them for undo purposes?

They are not atomic for selection or deletion but they
are for undo. Imagine it like pasting a string.
Undo should unpaste the string.

> Again, I'm not a native speaker, but I'd guess that
> the answer to the first
> question is yes. The second is less clear to me.
>
> but wait, there's more
> ----------------------
> Now we get closer to the screw cases I was worried
> about.
>
> For example, consider example A.4. Since we're
> letting the IME do the
> necessary composition (or decomposition), we have no
> way to differentiate
> the keystrokes used to create the two lam-alef
> ligatures here. Thus, should
> undo:

Arabic doesn't use an IME. To enter lam-alef you have
two options: hit the lam key, then hit the alef key;
or: hit the lam-alef key. For the former all OSes
will
see two keypresses each representing one character.
For the second it may depend on the OS whether there
is one event containing two charcodes or two seperate
events...

> - glob the "first" one (for a total of 4 steps),
> or
> - decompose the second one (for a total of 6
> steps)?

I'd be surprised if any OS returned a precomposed
glyph but from memory the old Arabic encodings and
Unicode do both support a precomposed lam+alef so
maybe some OS does. The Arabic Windows I have played
with definitely get two characters that I can delete
individually.

The easiest way to handle all of this is to never do
any kind of normalizing ourselves yet. If the OS
gives us a precomposed ligature we treat it as a
single character, if the OS gives us two characters
which the renderer may combine into a ligature we
treat them as seperate characters. This is probably
what the users are used to on their respective OSes
now.

> ( For anyone tempted to be tricky, yes we could
> theoretically jigger the
> undo records to differentiate them when originally
> typed, but not after the
> file's been stored and reloaded. Both behaviors
> should be consistent, no? )

Whether to normalize our files or even normalize each
line as the return key is hit is probably worth
thinking about at some point but probably not
essential
yet...

> Even worse, how many undo steps should there be
> after typing the Tamil word
> in example A.5?
>
> - six?
> - five?
> - four?
> - one? (ie, punt and just don't allow
> character-level undo)

There should be one undo for each keypress. Re-render
after each undo.

> Don't ask me, I don't speak or type either language.
> :-)
>
> bottom line
> -----------
> Like Martin, I've been hoping that Unicode was
> monstrous enough that we
> could always expect to encounter fully-composed
> characters in the piece
> table. That way, undo and selections would create a
> user experience that
> would Just Work like our trivial Latin cases.
>
> Evidently, it's not that simple, which is why we
> need answers to these
> questions.

Correct ):

> If my suggestions in the selection case make sense,
> then is there any reason
> to allow undo granularity which is *finer* than the
> selection granularity?

Undo should undo each input event (keypress or IME).
Select should select what looks like a typable unit.
So undo granularity would be finer than selection
granularity in the case of a Vietnamese a with tone
mark. Select would select both. Undo would undo
first the tone mark, then the a.

> If so, when and how? What kind of user experience
> will Just Work for the
> cases Andrew is most worried about?

We really need to attract a bunch of Asian, Indian,
and Arabic language developers and/or testers!

Andrew Dunbar.

> OK folks, have at it!
>
> Paul

=====
http://linguaphile.sourceforge.net http://www.abisource.com

__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com

Next message: Andrew Dunbar: "Hancom Office 2.01 for Linux"
Previous message: Andrew Dunbar: "Re: selections and combining characters"
In reply to: Paul Rohr: "Re: undo and combining characters"
Next in thread: Karl Ove Hufthammer: "Re: undo and combining characters"
Next in thread: Andrew Dunbar: "Re: undo and combining characters"
Reply: Karl Ove Hufthammer: "Re: undo and combining characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.4 : Tue Apr 23 2002 - 00:05:46 EDT