RE: dogfood feedback -- Smart Quotes


Subject: RE: dogfood feedback -- Smart Quotes
From: Vlad Harchev (hvv@hippo.ru)
Date: Sat Mar 03 2001 - 03:00:01 CST


On Fri, 2 Mar 2001, Paul Rohr wrote:

 Hi Paul,

> Vlad,
>
> I've stepped through the debugger to isolate the following problem and it
> *is* in your code. The smart quotes are getting removed (on NT) by the
> try_UtoNative() call inside your s_AbiWord_1_Listener::_outputData()
> implementation. To reproduce this, try linking with the version of libiconv
> we ship.

  I think it's libiconv that is broken - when converting it tries to
substitute some unicode characters not available in target charset with their
visual equivalents (like substituting those smartquote characters with plain
double quotes, and en-dashes with "-"). We can fix our version of libiconv
not to do this.
 
> I'm open to suggestions as to what the correct behavior here should be.
> According to the current stew of ifdefs in that function, there are (at
> least) three different approaches we could choose for non-us-ascii
> characters:
>
> 1. convert them into numeric entities (Jeff's original choice)
> 2. convert to UTF8 (the fallback he didn't choose)
> 3. remap some subset of them to native encodings
>
> Note that since #1 is a strict subset of #2, both can get by without
> declaring an explicit encoding in the XML header.
>
> Could you explain again why you want to change the encoding at export time
> (ie #3)? It feels like a brittle / lossy operation -- in this instance at
> least. I can't speak for other platforms or locales, but on Windows for
> Latin-1 locales (at minimum), I highly doubt that we want *any* such
> remapping at export time.
>

 I think that the best solution is to save in utf8 or as numeric entities
under latin1 and CJK locales (I don't know what to prefer here), and save in
native encoding on other locales. As for why it's preferred to save in native
encoding for non-latin1 8bit locales - just because the .abw files could be
generated using script without involving iconv(1), because it's easier to edit
such files by hand using not-very advanced editor, and because it's easier to
view such files with plain viewers. For example, no russian characters are
present in iso-8859-1, so when saving file in russian in encoding different
from native it will consist of numeric entities or of 2-byte utf8 sequences.

 So, I think that saving in utf8 or as numeric entities (please suggest what
to choose) under latin1 and probably CJK locales is the best solution.

 What do you think?

 Best regards,
  -Vlad



This archive was generated by hypermail 2b25 : Sat Mar 03 2001 - 03:38:53 CST