smart quotes, et al (was: Re: The 1.0 Jobs List ( :-) )


Subject: smart quotes, et al (was: Re: The 1.0 Jobs List ( :-) )
From: WJCarpenter (bill-abisource@carpenter.ORG)
Date: Fri Jun 16 2000 - 12:40:54 CDT


>> 2) 757/764 These are Word import bugs, having to do with Word's
>> "smart" quotes. This affects lots of documents, and would
>> therefore be nice to fix.

paul> Unicode has smart quote characters, but we may have to do some
paul> font-mapping at rendering time to get them to display properly
paul> on some platforms.

I agree with the latter characterization. After import of an MSWord
smart quoted document, one sees these character codes (I haven't
verified that those are the appropriate Unicode values):

        0x2019 apostrophe
        0x201c open double quotes
        0x201d close double quotes

I think the fonts that AbiWord uses don't have glyphs in those
positions, so GDK just draws a zero-width character there. If you use
cursor keys to navigate, you see that it takes an extra step to get
past where you know there is an invisible smart quote character.

In the abi code which calls gdk_draw_text_wc() to render the glyphs,
there is already a copy of the text to be rendered (to convert from
16bit to 32bit). As an experiment, I have done a brute force
conversion of the above three characters into something visible, and
the results were the Right Thing. To state what is perhaps obvious,
since this is just a render-time substitution, cut and paste preserve
the original values.

For me, the problem decomposes into two related problems: First, there
is always the possibility of characters that don't have glyphs in the
current font. Second, there is the algorithmic stuff affiliated with
the particular case of smart quotes. (Well, OK, three problems: the
interface to ispell doesn't recognize the smart quote values as
punctuation, so any smart-quoted words show as spelling errors.)
Obviously, my experiment only examined the first problem. Some
approaches with suggest themselves to me:

1. A generalized "glyph map" thing. Configuration can specify what
mapping is to take place, purely for the sake of screen and/or print
rendering. Variations include (a) map this value to this value in the
current font (easy to do); (b) map this value to this value in some
other font (slightly harder); (c) when you're about to render a
character that has no glyph in the current font, substitute this glyph
instead (slightly harder).

2. A smart quote algorithm. Unless it is overwhelming to humans,
this might as well be user-configurable and more general than just the
smart quotes we know. The elements of the configuration would be (a)
an input character value, (b) a regular expression, and (c) a
substitute value. The algorithm is something like "when any input
character is presented, see if in context it would match the regular
expression; if so, use the substitute value instead". It's slightly
more complicated to implement because the regular expression would
sometimes specify an arbitrary number of characters after the one
currently being typed, so some ex post facto rearrangement as the user
continues typing would be called for. Except for that limited sense,
the mechanism doesn't need to have any memory of substitutions already
completed.

-- 
bill@carpenter.ORG (WJCarpenter)    PGP 0x91865119
38 95 1B 69 C9 C6 3D 25    73 46 32 04 69 D6 ED F3



This archive was generated by hypermail 2b25 : Fri Jun 16 2000 - 12:42:14 CDT