Re: commit: abi: UTF8String class

From: Dom Lachowicz (doml@appligent.com)
Date: Mon Apr 22 2002 - 14:30:09 EDT

Next message: Karl Ove Hufthammer: "Re: Definitions: character, codepoint, glyph"

Previous message: Kenneth J.Davis: "commit: AbiPaint improvements"
In reply to: Andrew Dunbar: "Re: commit: abi: UTF8String class"
Next in thread: Tomas Frydrych: "single or multiple internal encodings"
Next in thread: Joaquin Cuenca Abela: "Re: commit: abi: UTF8String class"
Reply: Tomas Frydrych: "single or multiple internal encodings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

On Sun, 2002-04-21 at 22:07, Andrew Dunbar wrote:

Andrew has asked at least twice for me to reply, so I feel compelled to
now :)

I view many separate issues here, as Paul's elephant metaphor tries to
address. I feel that things should be broken up into much smaller parts
and then addressed separately. This thread has gotten *far* too long
already, and is impossible to follow. I've done my best to catch up
after this weekend-barraige.

IMO, we can look at the whole problem as a combination and interaction
of the component parts. Many of these parts, in my mind, have very clean
boundaries. Some, however, may not.

Several (but surely not all) of these important components are:

Piece Table
GUI (dialogs, buttons, menus)
GUI (layout classes, "canvas" representation of the document)
Interaction with outside libraries/programs (eg: spell-checking code)
Disk/File representation of the document (and any limitations of the
respecitive xml parsers we use)

Of course there is a ripple/trickle effect throughout the code, but from
what I see, these all seem fairly self-contained. We need to list what
the design goals are for each component, and then map them to each
other. For instance we might want this (or we might not...):

-----------------------------------

GUI: input comes in in whatever format the toolkit defines (GTK+ - utf8,
Win32: ucs2?). iconv() to convert that back to the backend. For the
times when we need strings/data from the backend, convert it to whatever
format the front-end needs.

Spell-checking code: aspell/pspell can support various encodings, which
might be a big win in the long run if we use it. As it stands, we
already map UCS-2 to something that ispell can understand (through
hash-encoding files). Defining an XXX->ispell conversion isn't that
tough.

Disk/file: Do we want this to be the same encoding as in the PT? Do we
want to store this in a user's native locale? Do we want to store this
in UTF-8 everywhere and be done with it? There are darn good reasons for
all of these choices. Let's argue out their merits.

Piece Table: Do we use fixed or variable width encodings? I'd probably
be in favor of fixed-width encodings (UCS-2 or 32). Is processing UTF-8
computentionally intensive? Probably a little, but nothing outrageous.
Will picking a fixed-width encoding just screw us over down the road?
Beats me.

-----------------------------------

But then there are theoretical arguments against this above proposed
separation:

* ) Computational overhead/"slowness"
* ) Need to keep track of and use other bits of data (encodings for
various string types, if applicable)
* ) Increased memory usage
* ) ....

-----------------------------------

Find the boundaries and different components. Find out what works best
for each of them, and then make sure that they can work well with each
other. This is either one giant impossible problem, or many smaller
solvable ones.

Please split this thread into separate threads and discuss them
individually. What are the concerns that we see? What problems are we
trying to solve and how will choosing XYZ (best) solve them for us? What
new problems does that solution create? We'll see that there's probably
a best fit somewhere.

Cheers,
Dom

application/pgp-signature attachment: This is a digitally signed message part

Next message: Karl Ove Hufthammer: "Re: Definitions: character, codepoint, glyph"
Previous message: Kenneth J.Davis: "commit: AbiPaint improvements"
In reply to: Andrew Dunbar: "Re: commit: abi: UTF8String class"
Next in thread: Tomas Frydrych: "single or multiple internal encodings"
Next in thread: Joaquin Cuenca Abela: "Re: commit: abi: UTF8String class"
Reply: Tomas Frydrych: "single or multiple internal encodings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.4 : Mon Apr 22 2002 - 14:35:29 EDT