Subject: Re: UCS-2 vs. UCS-4
From: Thomas Fletcher (thomasf@qnx.com)
Date: Tue Jun 26 2001 - 08:29:15 CDT
On Sat, 23 Jun 2001, Martin Sevior wrote:
>
> This is an interesting debate. One extra point we should all keep in mind
> is that we probabally don't waste much more space going from 16 => 32 bits
> for character representation.
[Other comments about sizes of data structures snipped]
Martin,
Call me crazy ... but I _totally_ don't believe this statement. For
anyone working on documents of any size, our memory consumption is an
issue. Deciding to double the per character memory requirements will
add up. While some systems are swappable ... we certainly don't want
to count out the fact that Abi could be used on smaller devices.
I'm all for Mike suggestion of a scalable class that hides all of
this work from me.
Thomas
> The Piece Table consists of a doubly linked-list of Fragments (Frags) of
> various sorts. These Fragments can represent 0 or more characters. Each
> Fragment is a sequence of contiguous text with identical properties.
> Format Marks are presentented by Fragments of 0 characters. Struxes (like
> a Paragragraph breaks) are Fragments of 1 character size.
>
> Each Fragment is a class which altogether must consist of at least a
> few hundred bytes. What I'm saying is that there is a considerable
> overhead for each textual character in the PieceTable. There are few
> occasions in AbiWord where the size of the text "in" a Fragment is
> actually larger than the Fragment itself.
>
> This being the case, if we really need to handle > 16 bits I don't think
> we lose much by just making a global change, UCS-2 => UCS-4. It will be
> easier to code and almost certainly be faster.
>
> BTW we currently make LOTS of assumptions of fixed size per character.
>
> Doing a global change UCS-2 => UCS-4 is probabally just a single perl
> script and a rewrite of some UT_UCS_* functions to handle 32 bit
> characters.
>
> Hunting theough all the abi code to fix fixed size character assumptions
> would be REALLY hard work.
>
> I'm strongly in favour of doing a UCS-2 => UCS-4 global change should
> support above 16 bits be deemed neccessary.
>
> Cheers
>
> Martin
>
>
>
> On Sat, 23 Jun 2001, Andrew Dunbar wrote:
>
> > Mike Nordell wrote:
> > >
> > > Please see this post as more-or-less brainstorming.
> > >
> > > It seems that currently all (?) of us don't use anything larger than UCS-2,
> > > but in a not too distand future perhaps we will have to use 2^32 for
> > > character representations (makes me whish for plain ASCII and console-mode
> > > again - I sure as hell don't want to keep track of 4 _billion_ chars).
> > >
> > > I don't know if this is a problem already, but if it is; what about creating
> > > a factory for encoding? Like:
> > >
> > > ASCII_Factory
> > > UTF8_Factory
> > > UCS2_Factory
> > > UCS4_Factory
> > >
> > > and let them return objects that can handle (what to the outside looks like
> > > a linked list of "void*") the chars from a document (or piece table or
> > > whatever, I'm not sure at what level this should be implemented)?
> > >
> > > My idea was something like:
> > > Start at ASCII. If someone enter an outside-ASCII-range char the
> > > document is "upgraded" to the nect level that can handle that type of chars.
> > >
> > > When saving, check what max "level" is used, and save using that one.
> > > Example: If someone used 16-bit chars but entered a UCS-4 char, the engine
> > > would "upgrade" the full document [1] to UCS-4. When saving, if those
> > > specific characters were removed, it would "back down" to UCS-2.
> >
> > I think this is a good idea. The key is that we have a "string" class
> > about which we do not make assumptions regarding character
> > representation.
> > UT_UCSChar is such an assumption currently.
> >
> > However, UTF-8, UTF-16, and UTF-32 can all handle 32-bit codepoints.
> > I'm still not sure if UCS-2 is defined as different to UTF-16 in this
> > regard. UTF-16 definitely handles surrogates but I'm not sure if
> > they're part of UCS-2 nor not. Otherwise the two are the same thing.
> > Correct me if I'm wrong please. Anyway, UTF-8 is always 8 bit
> > based so it is a superset of ASCII - no upgrading needed. It is
> > multibyte meaning for a 31 (not 32) bit range of characters it can
> > take from 1 to 6 bytes. Usually 1 byte for English, 2 bytes for
> > accented characters, and 3 bytes for Chinese/Japanese/Korean, and
> > more for really exotic stuff hardly used yet. UCS-2 can handle
> > up to 2^16 range of characters in sixteen bits. Above that we
> > have "surrogates" which mean we now have to handle possible pairs
> > of sixteen bit values. This is where our UT_UCSChar is not
> > compatible. UTF-32 always uses a single 32 bit value to hold
> > any character whatsoever.
> >
> > So if we used UTF-8 internally we never need to upgrade and
> > sizes are always pretty good. But we need functions which
> > expect to iterate through the string and never use true
> > random access. We can also do this using UCS-2/UTF-32 if
> > we handle surrogates properly. This also means no true
> > random access. If we really do need random access we might
> > be able to have a UTF-8 -> UCS-2 (no surrogates) -> UTF-32
> > system of upgrades.
> >
> > There are also issues involving the concept of a character
> > versus a codepoint versus a glyph which boil down to the
> > reality that we should always treat a single "character" as
> > a string.
> >
> > Properly designed and coded I don't think this is difficult
> > and should mean we can still have an ASCII only build
> > and a fully multilingual build and keep everyone happy.
> >
> > Andrew Dunbar.
> >
> > --
> > http://linguaphile.sourceforge.net
> >
> >
> > _________________________________________________________
> >
> > Do You Yahoo!?
> >
> > Get your free @yahoo.com address at http://mail.yahoo.com
> >
> >
> >
> >
> >
>
>
-------------------------------------------------------------
Thomas (toe-mah) Fletcher QNX Software Systems
thomasf@qnx.com Neutrino Development Group
(613)-591-0931 http://www.qnx.com/~thomasf
This archive was generated by hypermail 2b25 : Tue Jun 26 2001 - 08:41:51 CDT