Re: support for 32-bit Unicode


Subject: Re: support for 32-bit Unicode
From: Martin Sevior (msevior@mccubbin.ph.unimelb.edu.au)
Date: Mon Feb 04 2002 - 07:59:38 CST


On Mon, 4 Feb 2002, Tomas Frydrych wrote:

>
> > What we need to do is support the full 32-bit Unicode
> > character set but we shouldn't use UTF-32 to do it
> > since we'll waste vast amounts of memory space since
> > characters above 16-bit are very very rare. We need
> > to instead switch to UTF-8 internally for everything.
> > This is the right answer for several reasons which
> > have all been covered in depth on several mailing
> > lists
> Since the characters have a variable bit-widthutf, utf-8 processing is
> very cpu intensive for everything but the basic 7-bit ascii charset. It
> is not meant to be used interanlly by applications, it is meant as
> an encoding for communication between applications over 8-bit
> chanells. Internally we need to use a fixed-width encoding, so if we
> want to support 32-bit Unicode, we have to redefine UT_UCSChar
> to long.
>
> I agree that having 32 UT_UCSChar would vaste lot of memory, and
> I would like to see a case made first why we need to support 32-bit
> Unicode.
>

Actually it would add at most 1 - 2% to our document-based memory
requirements. Most of the memory requirements come from the the piecetable
frags, fp_run, all the const char strings used to define text
properties and fl_BlockLayout classes used to define our document.

I really think we should use 32 bit UT_UCSChar. I VERY STRONGLY oppose
using UTF-8 internally because it is both slow and would fundamentally
break lots of assumptions we make throughout abiword.

I won't accept UTF-8 for internal storage.

Martin



This archive was generated by hypermail 2b25 : Mon Feb 04 2002 - 07:59:48 CST