Re: Two patches, and a file to remove


Subject: Re: Two patches, and a file to remove
From: Karl Ove Hufthammer (huftis@bigfoot.com)
Date: Thu Aug 10 2000 - 14:23:31 CDT


----- Original Message -----
From: "Martin Vermeer" <martin.vermeer@fgi.fi>
To: <abiword-dev@abisource.com>
Cc: "Karl Ove Hufthammer" <huftis@bigfoot.com>
Sent: Thursday, August 10, 2000 5:39 PM
Subject: Re: Two patches, and a file to remove

| On Thu, Aug 10, 2000 at 04:14:29PM +0200, Karl Ove Hufthammer wrote:
|
| > Not so great. If your browser doesn't render U+201E, then that's a problem
| > with your browser (or your fonts).
|
| It appears to be a browser problem. It just prints &bdquo;.

Which browser?

BTW, if we use entity references, it's probably wiser to use numerical instead
of named entity references, e.g. &#8211; instead of &bdquo;. I think--but I'm
not sure--more browsers understand numerical references.

| > The most serious problem is that it *loses* information. When the U+201E
| > character is converted to ',,' we have no way of knowing what the original
| > character was. ,, is not the same as U+201E, and browsers won't treat it the
| > same way. A couple of examples:
|
| [...]
|
| I agree with all that. So what is policy? Cater for older browsers? Or for
| newer (better) standards?

Well, if we cater for older browsers, newer browsers will get "bad" output too,
and users will be stuck with old files containing incorrect/ugly characters. If
we do the right thing (according to standards), we'll get much better results in
newer browsers (and other HTML-reading programs), and users will eventually
upgrade their browsers (especially when UTF-8 becomes a widely-used character
encoding for web pages). In a few months time, most users will use IE 5.5,
Mozilla 1/Navigator 6 or Lynx, and all these browsers can render these Unicode
characters.

| > What I've written here goes for *all* Unicode characters. There's no need
| > "reduce" them to similar-looking characters.
|
| Unfortunately wv/text.c contains _already_ some examples of that. If we
| follow your policy, that should be fixed too.

I see wv/text.c turns single quotation marks into `and '. This really looks very
bad (like TeX source!). And I hate seeing all my ellipsis being turned periods
... :(
But there's a bug in the original too:
   printf("&#133;");
There's no character at 133 (U+0085) in Unicode, so this should be displayed as
nothing (or a question mark or a box) in conforming browsers. Entity references
*always* point to the characters in the document character set (which is
*always* Unicode (well, not really, it's ISO 10646, which is identical) for HTML
(and XHTML), regardless of the document character encoding).

Example:

A HTML document marked as using the windows-1253 (greek) character encoding has
this entity reference: &#217;
Even though there is a Omega characters at this position in the windows-1253, a
Ù should be rendered (U+00D9), since entity references always point to
characters in the document character set (ISO 10646/Unicode).
But, if you write a character with byte value 217 (decimal), Ù, this should be
rendered as an Omega, and not as a Ù (this is really much less confusing than it
sounds like :) ).

So, the ellipis character should be written as &#8230;, not &#133;.

| So what to do? I am happy either way.

I think we should use the correct characters, but if anybody disagrees, I'm
always willing to listen ... :)

--#
Karl Ove Hufthammer



This archive was generated by hypermail 2b25 : Thu Aug 10 2000 - 14:24:23 CDT