Re: Unicode UCS-2 importer


Subject: Re: Unicode UCS-2 importer
From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Tue May 15 2001 - 09:22:25 CDT


Sam TH wrote:
>
> On Tue, May 15, 2001 at 11:23:17PM +1000, Andrew Dunbar wrote:
> > Tomas Frydrych wrote:
> > >
> > > Hi Andrew,
> > >
> > > I have had a quick look at the importer and overall it looks good,
> > > but I would be much happier if we could do without the goto
> > > construct, something like
> > >
> > > if( error = _writeHeader(fp) == UT_OK)
> > > {
> > > error = _parseFile(fp);
> > > }
> > > fclose(fp);
> > > return error;
> >
> > Actually my code is derived from the Text and UTF-8 importers so
> > maybe all three need to be fixed?
> >
> > > Also, if the file does not contain a BOM marker, the exporter
> > > assumes the file is little-endian. Should we not, assume the
> > > opposite, since, if I am not mistaken, Unicode is bigendian by
> > > definition.
> >
> > Well I'm not sure what the long term solution would be but
> > nobody has replied to my Unicode text import post yet.
> > The immediate goal is to be able to import Windows Notepad and
> > MS Word Unicode files and they use UCS-2 little-endian.
> >
> > A better solution would either require a parameter to specify
> > endianness or two separate importers and exporters. One for
> > UCS-2 big-endian and one for UCS-2 little endian. Ugly.
> >
> > I'm actually beginning to think we should just have one Text
> > importer and one Text exporter and since they use iconv
> > already we should just treat ISO, UTF-8, UCS-2 little and
> > big endian as an encoding parameter. That'll cut down on
> > code duplication too. Any ideas?
>
> This is an excellent idea, the fewer importers we have, the better.
> (Actually, my goal would be to have no choices in the open dialog, and
> have AbiWord just do the right thing automatically). However,
> according to unicode.org, there are both UTF-16 big endian and little
> endian encodings, both of which are valid, and which don't have to
> have a BOM. That's in addition to documents w/ a BOM.
>
> Is there still a way we can do everything automatically?

Well we can't do *everything* automatically but when can be pretty
intuitive. We can detect UTF-8 90+% of the time. We can detect
UCS-2 (is UTF-16 the same thing?) 99% if it has a BOM. Without a
BOM is tougher but if we know the we're supposed to be loading a
Unicode file and we know it's not UTF-8 we can assume the platform
default - little-endian for Windows, big-endian for real OSes.
Windows software *always* writes a BOM so if we're not Windows and
there is no BOM we can pretty safely assume big-endian.

We still need for selection to be possible for thos of us who do
lots of file interchanging and do know about different encodings.
MSWord actually handles this very well.

For me there are two issues - how to pass the extra parameter to
the importer/exporter and how to get a list of all the encodings
iconv supports for when we do need the user to select one.

Andrew.

-- 
http://linguaphile.sourceforge.net

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com




This archive was generated by hypermail 2b25 : Sat May 26 2001 - 03:51:04 CDT