Re: XML_Char


Subject: Re: XML_Char
From: Mike Nordell (tamlin@algonet.se)
Date: Sun Dec 17 2000 - 22:12:34 CST


Aaron Lehmann wrote:
> As quoted in the mail I just sent, XML_Char is defined as char. UTF-8
> is used to store multibyte chars. And no, it won't be changed, since
> it's a libxml/expat typedef and not abi's.

Are we then gonna try to redefine C++? A string literal without leading "L"
is of one, and exactly one, type: const char* (or to be even more precise, a
"const char[]" that can be freely converted to a char*). This is *not* an
UTF-8 string, its a "char" string.I'f we were to allow non-char data we
should either start to use void* (to be able to use anything: bad (tm)) or
use a *real* UTF-8 datatype (that we would have to invent I think).

> Interesting that you should mention that it's typedeffed to a char,
> since expat and libxml2 disagree upon this. While expat defines it as
> a char, libxml2 defines it as an unsigned char. This causes a
> multitude of warnings:

Expat would allow C/C++ string literals whereas libxml2 wouldn't. Both are
wrong thoug.

> Any ideas?

class UT_XML_Char // please disregard formatting
{
    UT_XML_Char(const char* szNativeString);
    UT_XML_Char(const UT_XML_Char& rhs);
    ...
    const void* str() const { return m_pUTF8string; }
private:
    void* m_pUTF8string;
};

?

Seriously, there is but one solution to this, and that *is* to encapsulate
it. So what if it doesn't match the interfaces of other ("C") libraries,
that's what the adaptor pattern is for.

In C++ we have three different byte-sized char types: "char", "signed char"
and "unsugned char". These have been abused in C to hold UTF-8 (and other
stuff also), but no matter how you look at it it's wrong. An UTF-8 string is
not a, and can never be, C/C++ char string literal since they are by their
very definition one-byte chars whereas UTF-8 are multibyte.

/Mike - please don't cc



This archive was generated by hypermail 2b25 : Sun Dec 17 2000 - 22:11:08 CST