Re: Patch: Multi-encoding Text import/export


Subject: Re: Patch: Multi-encoding Text import/export
From: Vlad Harchev (hvv@hippo.ru)
Date: Sat May 19 2001 - 13:39:13 CDT


On Sat, 19 May 2001, Sam TH wrote:

 Hi,

> On Sat, May 19, 2001 at 06:19:21PM +1000, Andrew Dunbar wrote:
> > I consider this a pretty important change.
> >
> > It allows you to import a text file no matter if
> > it's an old 8-bit encoding, UTF-8, or UCS-2 as is
> > used in Windows and Mac OSX.
> >
> > It also allows you to export to any of these text
> > formats - though changes are needed to the rest of
> > AbiWord to fully support this.
> >
> > This also means we will no longer need separate
> > UTF-8 and UCS-2 importers and exporters and any
> > .txt file will "just work" - perfect for church
> > secretaries (:
> >
> > Please somebody have a serious look at this!
> > Feedback much appreciated.
>
> This looks really good. A couple quick comments:
>
> - _recognizeUCS/UTF8 should definitely be members of class.
> IE_Imp_Text_Sniffer is probably the best choice.
>
> - All the new functions need doxygen comments.
>
> Those two you should fix before someone commits this. They shouldn't
> be too hard.
>
> The third thing is that UTF8 can be various-endian as well, so you
> probably want to detect that.

 You are plain wrong here. UTF8 is a sequence of bytes (and the ability
to recognize offset from start of sequence is the key feature of utf8 - utf8
can't be endian).
 
> Question: does our current UTF8 export use a byte-order mark? If not,
> it probably should.

 No, byte order mark is meaningful only to UCS2 and UCS4.

 Best regards,
  -Vlad



This archive was generated by hypermail 2b25 : Sat May 26 2001 - 03:51:05 CDT