From: Raphael Finkel (raphael@cs.uky.edu)
Date: Tue Jun 24 2003 - 09:27:07 EDT
Note: This message is in UTF-8, and it uses non-Latin characters.
0. I think we are confusing ligatures, precomposed characters, and
compound characters. All are important, but they are different.
1. A ligature is a way to show several characters with a single glyph.
The glyph is generally not the same as the overlay of the two
characters. Examples are œ (oe), fi (fi), and ffi (ffi). These ligatures
are not always appropriate; many languages do not combine oe, for
instance. Ligatures generally have Unicode code points, and the Unicode
name typically has the word LIGATURE in it. The author of a text
typically needs to choose whether or not to use a ligature, although the
input method can assist the author in this choice. Once this choice is
made, it should be reflected both in the display (if possible) and in
the stored text. If the author (or input method) chooses not to ligate,
then both the display and the stored text should respect this decision,
although a tool that suggests ligations would be very helpful if the
input method is deficient.
2. A precomposed character is a way to show a character and a combining
accent in a single glyph. The glyph is generally identical or very
close to an overlay of the character and the accent. Examples are Á (A
with ́), Ѷ (Ѵ with ̏), and אַ (א with ַ). Precomposed characters have
Unicode code points with names that have the word WITH. The author of a
text is usually indifferent as to whether the text is displayed with the
precomposed character, although the quality is often higher when the
precomposed character is shown. If the author (or the input method)
chooses not to use the precomposition, then the display should use the
precomposition (if one exists), because it looks better, but the stored
text should record the separate characters, because that is what the
author inserted and what spell checkers will assume. (My uspell
spelling checker calls wrong precomposition an error, but it
unprecomposes in order to suggest a good replacement.)
It is possible to simultaneously ligate and precompose, as in the glyph
ײַ (י ligated to י with accent ַ: HEBREW LIGATURE YIDDISH YOD YOD PATAH).
3. Some characters look like ligatures but are not called ligatures by
Unicode. Examples are « (<<), ǁ (||), and ʺ (ʹʹ). To give them a name,
let's call them compound characters. They have Unicode code points that
don't necessarily indicate that they are compound, although sometimes
the word DOUBLE appears. Compound characters should be treated like
ligatures.
4. Vim handles these issues as follows. Vim faithfully stores exactly
what the input method has given it, which can be any Unicode character.
Vim provides a way to explicitly request a ligature; to get œ, you may
type <ctrl-K>oe. Vim provides a way to explicitly request Unicode code
points; to get יִ, you may type <ctrl-V>ufb1d. Vim uses the ligature
method to allow fast input of some compound characters, but not all.
Vim does not display any precomposed characters, but if you are using
"console" Vim (as opposed to GUI Vim) in an xterm, then the xterm will
silently apply any precompositions that are available in the font.
5. I think AbiWord might follow this example. (a) Characters should be
stored exactly as they are input (or presented by an input method). (b)
The default input method should include a way to explicitly request
ligatures and arbitrary Unicode code points. (c) Precomposed characters
should be used for the display in all cases where the current font
allows, but the precomposition should never be stored unless it was
input. (d) If a precomposition is input that is not supported by the
font, then it should be displayed like any other missing character. I
personally like the method that Yudit uses for displaying missing
characters: Yudit shows a box containing the Unicode code point (in 4
hex characters, 2 in each of 2 rows).
6. This solution is not perfect. The ligature fi (fi) should most likely
never be stored in a document, because spell checkers will not accept it
as equivalent to fi. But the document will look better (in many fonts)
if this ligature is applied. Perhaps we need the concept of a
display-only ligature, selectable by the user, that automatically
ligates certain combinations for display purposes only. I don't think œ
(oe) fits into this class, but it might.
Other characters require thought as well. I would call the German ß
(ss) a compound, and I would call the German ö (oe) a precomposition
(but not of o and e!). In these cases, I think we should follow my
recommendations above and require that the input method insert these
characters if they are desired.
Raphael Finkel
This archive was generated by hypermail 2.1.4 : Tue Jun 24 2003 - 09:38:12 EDT