Re: commit (HEAD): win32 glyph presence detection

From: Raphael Finkel (raphael@cs.uky.edu)
Date: Tue Jun 24 2003 - 09:27:07 EDT

  • Next message: Alan Horkan: "Re: Gnome Office [Re: Dublin meeting]"

    Note: This message is in UTF-8, and it uses non-Latin characters.

    0. I think we are confusing ligatures, precomposed characters, and
    compound characters. All are important, but they are different.

    1. A ligature is a way to show several characters with a single glyph.
    The glyph is generally not the same as the overlay of the two
    characters. Examples are œ (oe), fi (fi), and ffi (ffi). These ligatures
    are not always appropriate; many languages do not combine oe, for
    instance. Ligatures generally have Unicode code points, and the Unicode
    name typically has the word LIGATURE in it. The author of a text
    typically needs to choose whether or not to use a ligature, although the
    input method can assist the author in this choice. Once this choice is
    made, it should be reflected both in the display (if possible) and in
    the stored text. If the author (or input method) chooses not to ligate,
    then both the display and the stored text should respect this decision,
    although a tool that suggests ligations would be very helpful if the
    input method is deficient.

    2. A precomposed character is a way to show a character and a combining
    accent in a single glyph. The glyph is generally identical or very
    close to an overlay of the character and the accent. Examples are Á (A
    with ́), Ѷ (Ѵ with ̏), and אַ (א with ַ). Precomposed characters have
    Unicode code points with names that have the word WITH. The author of a
    text is usually indifferent as to whether the text is displayed with the
    precomposed character, although the quality is often higher when the
    precomposed character is shown. If the author (or the input method)
    chooses not to use the precomposition, then the display should use the
    precomposition (if one exists), because it looks better, but the stored
    text should record the separate characters, because that is what the
    author inserted and what spell checkers will assume. (My uspell
    spelling checker calls wrong precomposition an error, but it
    unprecomposes in order to suggest a good replacement.)

    It is possible to simultaneously ligate and precompose, as in the glyph
    ײַ (י ligated to י with accent ַ: HEBREW LIGATURE YIDDISH YOD YOD PATAH).

    3. Some characters look like ligatures but are not called ligatures by
    Unicode. Examples are « (<<), ǁ (||), and ʺ (ʹʹ). To give them a name,
    let's call them compound characters. They have Unicode code points that
    don't necessarily indicate that they are compound, although sometimes
    the word DOUBLE appears. Compound characters should be treated like
    ligatures.

    4. Vim handles these issues as follows. Vim faithfully stores exactly
    what the input method has given it, which can be any Unicode character.
    Vim provides a way to explicitly request a ligature; to get œ, you may
    type <ctrl-K>oe. Vim provides a way to explicitly request Unicode code
    points; to get יִ, you may type <ctrl-V>ufb1d. Vim uses the ligature
    method to allow fast input of some compound characters, but not all.
    Vim does not display any precomposed characters, but if you are using
    "console" Vim (as opposed to GUI Vim) in an xterm, then the xterm will
    silently apply any precompositions that are available in the font.

    5. I think AbiWord might follow this example. (a) Characters should be
    stored exactly as they are input (or presented by an input method). (b)
    The default input method should include a way to explicitly request
    ligatures and arbitrary Unicode code points. (c) Precomposed characters
    should be used for the display in all cases where the current font
    allows, but the precomposition should never be stored unless it was
    input. (d) If a precomposition is input that is not supported by the
    font, then it should be displayed like any other missing character. I
    personally like the method that Yudit uses for displaying missing
    characters: Yudit shows a box containing the Unicode code point (in 4
    hex characters, 2 in each of 2 rows).

    6. This solution is not perfect. The ligature fi (fi) should most likely
    never be stored in a document, because spell checkers will not accept it
    as equivalent to fi. But the document will look better (in many fonts)
    if this ligature is applied. Perhaps we need the concept of a
    display-only ligature, selectable by the user, that automatically
    ligates certain combinations for display purposes only. I don't think œ
    (oe) fits into this class, but it might.

    Other characters require thought as well. I would call the German ß
    (ss) a compound, and I would call the German ö (oe) a precomposition
    (but not of o and e!). In these cases, I think we should follow my
    recommendations above and require that the input method insert these
    characters if they are desired.

    Raphael Finkel



    This archive was generated by hypermail 2.1.4 : Tue Jun 24 2003 - 09:38:12 EDT