fields design -- FIELDs vs. CHUNKs (LONG)


Subject: fields design -- FIELDs vs. CHUNKs (LONG)
From: Paul Rohr (paul@abisource.com)
Date: Fri Sep 22 2000 - 19:43:26 CDT


The hardest design issue I faced when thinking through potential fields
implementation was the content model thing:

  How complicated can/should the contents of fields get?

As Justin pointed out early on, if we radically restrict the content model
for fields, that makes certain implementation strategies a lot easier. For
example, consider the following restrictions on the contents of a given field:

  (a). just inline text -- no breaks or formatting
  (b). inline text and formatting -- ie, add the C tag
  (c). same as (b), plus some breaks (line, page, column)
  (d). full generality -- also can have section or paragraph breaks

Likewise, you can complicate that enumeration by also allowing other,
non-textual content, such as:

  - images
  - other fields
  - etc.

So, what should we do?

do what RTF does?
-----------------
According to assumption #V, our mechanism will eventually need to be able to
handle whatever RTF throws at us. And that's where the problem comes in.
As it turns out, the biggest problem is the difference between (c) and (d)
above.

Conceptually, the Word & RTF formats use a delimited stream approach where
documents are a stream of characters which get "broken" by inserting various
inline formatting directives, including section and paragraph breaks.
(Indeed the UI behaves that way -- just look at the interaction between the
Insert Break dialog and Show Paragraphs mode).

By contrast, the AbiWord file format is XML-based, so it more directly
represents the "fact" that sections contain paragraphs which contain text
(with optional formatting at each of the three levels).

Note that I'm not arguing which approach is better, just highlighting a
difference. (No flamewars, please.) Usually, this difference is
meaningless, but now it's not.

why is this a problem?
----------------------
Fields, conceptually, are containers of auto-generated text. The simple
markup we've been planning to use for fields works just fine, so long as all
the generated content for that field is formatted text. For example:

  <p>
     ...
     <field type=... other=... args=...>
        ... <c> ... </c> ...
     </field>
     ...
  </p>

The type attribute tells what kind of field it is, and the contents can be
updated and replaced entirely by using the information attached to the other
args in a type-specific way. This corresponds quite nicely with the church
secretary's notion of simple fields like page number or the date and time
the document was last printed or saved.

If you squint hard enough, the RTF mechanism looks quite similar to this.
However, Word/RTF has radically expanded the notion of fields so they can
use this same mechanism to *also* generate large chunks of the document,
which may include paragraph and section breaks. For them, that's easy,
because those breaks are just another character among the contents of that
field.

For us, it's not that easy. If you look at our file format, a paragraph
break looks something like this:

  ... </p><p> ...

Likewise, this is a section break:

  ... </p></section><section><p> ...

XML's nesting rules explicitly forbid content which looks like this:

  <p>
     ...
     <field>
        ... </p></section><section><p> ...
     </field>
     ...
  </p>

There are a variety of ways to deal with this issue, but since none of the
simple fields we need now require this level of complexity, let's punt the
issue as follows...

solution: some RTF "fields" are really chunks instead
------------------------------------------------------
For our file format, let's define *two* tags, one for each concept. (Yep,
this breaks my design assumption #I, but I'm convinced it's the right way to
handle it.)

1. Define and implement the FIELD tag for all fields which can use a simple
"inline" content model,such as DATE, TIME, and PAGE. The necessary
mechanism is block-relative, and should be pretty simple. A document is
likely to have lots of FIELDs.

  1a. I expect (a), (b), and (c) are all implementable using this tag.

  1b. I don't know whether we need to allow FIELDs to also contain IMAGEs
  or other FIELDs, but I'm sure we can figure that out as we go along.

2. Use a separate set of CHUNK tags (to be defined later) for more complex
chunks of generated content, such as TOC or INCLUDE. The necessary
mechanism would be document-relative. Most documents should not have very
many CHUNKs (if any).

  2a. This is most useful for category (d).

  2b. I'm sure we'll want to allow CHUNKs to contain FIELDs, and perhaps to
  allow CHUNKs to contain CHUNKs. Again, implementation experience is the
  key here.

Note that both of these can still have a common UI, but under the hood we'll
handle them differently.

Paul



This archive was generated by hypermail 2b25 : Fri Sep 22 2000 - 19:36:56 CDT