Re: commit: abi: UTF8String class

From: Dom Lachowicz (doml@appligent.com)
Date: Mon Apr 22 2002 - 14:30:09 EDT

  • Next message: Karl Ove Hufthammer: "Re: Definitions: character, codepoint, glyph"

    On Sun, 2002-04-21 at 22:07, Andrew Dunbar wrote:

    Andrew has asked at least twice for me to reply, so I feel compelled to
    now :)

    I view many separate issues here, as Paul's elephant metaphor tries to
    address. I feel that things should be broken up into much smaller parts
    and then addressed separately. This thread has gotten *far* too long
    already, and is impossible to follow. I've done my best to catch up
    after this weekend-barraige.

    IMO, we can look at the whole problem as a combination and interaction
    of the component parts. Many of these parts, in my mind, have very clean
    boundaries. Some, however, may not.

    Several (but surely not all) of these important components are:

    Piece Table
    GUI (dialogs, buttons, menus)
    GUI (layout classes, "canvas" representation of the document)
    Interaction with outside libraries/programs (eg: spell-checking code)
    Disk/File representation of the document (and any limitations of the
    respecitive xml parsers we use)

    Of course there is a ripple/trickle effect throughout the code, but from
    what I see, these all seem fairly self-contained. We need to list what
    the design goals are for each component, and then map them to each
    other. For instance we might want this (or we might not...):

    -----------------------------------

    GUI: input comes in in whatever format the toolkit defines (GTK+ - utf8,
    Win32: ucs2?). iconv() to convert that back to the backend. For the
    times when we need strings/data from the backend, convert it to whatever
    format the front-end needs.

    Spell-checking code: aspell/pspell can support various encodings, which
    might be a big win in the long run if we use it. As it stands, we
    already map UCS-2 to something that ispell can understand (through
    hash-encoding files). Defining an XXX->ispell conversion isn't that
    tough.

    Disk/file: Do we want this to be the same encoding as in the PT? Do we
    want to store this in a user's native locale? Do we want to store this
    in UTF-8 everywhere and be done with it? There are darn good reasons for
    all of these choices. Let's argue out their merits.

    Piece Table: Do we use fixed or variable width encodings? I'd probably
    be in favor of fixed-width encodings (UCS-2 or 32). Is processing UTF-8
    computentionally intensive? Probably a little, but nothing outrageous.
    Will picking a fixed-width encoding just screw us over down the road?
    Beats me.

    -----------------------------------

    But then there are theoretical arguments against this above proposed
    separation:

    * ) Computational overhead/"slowness"
    * ) Need to keep track of and use other bits of data (encodings for
    various string types, if applicable)
    * ) Increased memory usage
    * ) ....

    -----------------------------------

    Find the boundaries and different components. Find out what works best
    for each of them, and then make sure that they can work well with each
    other. This is either one giant impossible problem, or many smaller
    solvable ones.

    Please split this thread into separate threads and discuss them
    individually. What are the concerns that we see? What problems are we
    trying to solve and how will choosing XYZ (best) solve them for us? What
    new problems does that solution create? We'll see that there's probably
    a best fit somewhere.

    Cheers,
    Dom





    This archive was generated by hypermail 2.1.4 : Mon Apr 22 2002 - 14:35:29 EDT