Re: single or multiple internal encodings

From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Tue Apr 23 2002 - 11:57:21 EDT

  • Next message: Andrew Dunbar: "Re: How to get there from Here."

     --- Tomas Frydrych <tomas@frydrych.uklinux.net>
    wrote: >
    > > GUI: input comes in in whatever format the toolkit
    > defines (GTK+ -
    > > utf8, Win32: ucs2?). iconv() to convert that back
    > to the backend. For
    > > the times when we need strings/data from the
    > backend, convert it to
    > > whatever format the front-end needs.
    >
    > This seems to the most logical approach. Most of the
    > gui strings
    > are static, they need only one translation, at load
    > time. The
    > interaction between the GUI and backend is virtually
    > nil, save
    > keyboard input, but considering human typing speeds
    > conversion
    > to a different encoding is not a reall performance
    > issue.

    Our input strings already go through iconv in the code
    I've played with. The input keycodes can come in any
    encoding and we convert them all to UCS-2.

    > > Disk/file: Do we want this to be the same encoding
    > as in the PT? Do we
    > > want to store this in a user's native locale? Do
    > we want to store this
    > > in UTF-8 everywhere and be done with it? There are
    > darn good reasons
    > > for all of these choices. Let's argue out their
    > merits.
    >
    > Currently are not consistent. The Win32 build uses
    > utf-8, Linux the
    > encoding of current locale. The advantage of using

    Really?? Why don't we do the same on all platforms?

    > the locale
    > encoding is the size of the file, for unless you use

    I'm sure this could run us into problems with some
    CJK encodings which might use for instance the angle
    bracket characters as part of a multibyte character.
    Shift JIS springs to mind as being pretty bad.

    > only characters
    > from basic ASCII, utf-8 needs at least two bytes for
    > each. The other
    > advantage of using the locale encoding is that the
    > user can
    > view/search, etc. the raw files. This is quite
    > important to a number
    > of users, and I think we should retain this. What,

    True to some extent. There are plenty of UTF-8 aware
    editors/viewers these days. The latest Notepad on
    Windows, the latest default basic text editor on Mac,
    even Vim 6.0. If it's so important why don't we do it
    for Windows too? I think it should at least be
    a setting in the user's profile.

    > however, I would
    > like to see, is for the user to be able to change
    > any default to an
    > encoding to his or her choice.

    For the "Encoded Text" import/export I added an
    encoding field to the document structure so we only
    need to set this and use it appropriately.

    > > Piece Table: Do we use fixed or variable width
    > encodings? I'd probably
    > > be in favor of fixed-width encodings (UCS-2 or
    > 32). Is processing
    > > UTF-8 computentionally intensive? Probably a
    > little, but nothing
    > > outrageous. Will picking a fixed-width encoding
    > just screw us over
    > > down the road? Beats me.
    > I strongly favour fixed width encoding, because it
    > is so much easier
    > to handle. Variable width would save some memory for
    > some users
    > and vaste it for others. Picking a fixed-width
    > encoding will not limit
    > us in any way in the future, and nor will
    > variable-width encoding --
    > they are just encodings, that's all.

    I think we're all agreed on UTF-32 in memory now.

    Andrew Dunbar.

    > Tomas

    =====
    http://linguaphile.sourceforge.net http://www.abisource.com

    __________________________________________________
    Do You Yahoo!?
    Everything you'll ever need on one web page
    from News and Sport to Email and Music Charts
    http://uk.my.yahoo.com



    This archive was generated by hypermail 2.1.4 : Tue Apr 23 2002 - 11:58:36 EDT