Fantastic desacription of PieceTable from Joaquin.

From: Martin Sevior <msevior_at_gmail.com>
Date: Sat May 22 2010 - 14:43:08 CEST

Hi everyone,

In the process of Fixing bugs 12721 and 12723 with our new O(LOG(N))
PieceTable, Joaquin wrote up this fantastic email explaining about
PieceTable's. With Joaquin's permission I've also put this text into
our wiki here.

http://www.abisource.com/wiki/AbiWordDevelopment

Cheers

Martin

---------- Forwarded message ----------
From: Joaquin Cuenca Abela <joaquin@cuencaabela.com>
Date: Fri, May 21, 2010 at 6:40 PM
Subject: Re: Bug in the erase method of the Red-Black PieceTable
To: Ersin Akinci <ersin.akinci@gmail.com>
Cc: Martin Sevior <msevior@gmail.com>

On Thu, May 20, 2010 at 9:31 PM, Ersin Akinci <ersin.akinci@gmail.com> wrote:
> Hi Joaquin,
>
> I'm still trying to figure out your red black tree implementation, and a I
> had a couple of questions.  My apologies in advance for my ignorance, I've
> been reading all about binary trees today and I think I have a decent grasp,
> but I'm still working through it.
>
> First, is the implementation you outlined at
> http://e98cuenc.free.fr/wordprocessor/piecetable.html still essentially
> correct, or have there been significant changes?

I see that the two pointers I left there pointing to existing piece
table descriptions are now 404.
Martin can give you more details, I will explain here only the big lines.

Abiword has always used a piece table to store the text (and
formatting, more on that later) of a document. The piece table mission
is store the text of the document, and allow you to insert, delete and
lookup characters from random locations quickly. Another structure
that can be used in a minimalistic editor can be simply an array with
some free space at the end (like a std::vector). You will be able to
lookup very quickly from an array as v[i] is just a direct lookup in
memory, a O(1) operation, but insertion will be very slow as to insert
in the i position you have first to move all the elements j > i one
position to the right. In the worst case, if you insert at i=0, you
will have to move N elements (an O(n) operation). You have the same
complexity for a delete operation.

As this used to be too slow even for moderately sized documents (I
don't know how this will work in modern computers), people started
using more complicated algorithms to store text where they can make
modifications faster. For instance emacs implemented a "gap array",
where they have an array exactly like the one I described, except that
the free space (the "gap") was not at the end of the array, but
somewhere in the middle (to be more exact, it was at the cursor
position). This way to insert a character you add it in the free space
and you shrink the gap 1 character. To delete a character you expand
the gap 1 character, and to move the cursor position to the left or to
the right you just copy 1 character from the left side of the gap to
the right side (effectively moving the gap to the left or to the
right). As you can imagine, that's not very efficient if you want to
make the cursor jump quickly from the beginning of the document to the
end, and you get a bit blocked when the gap is exhausted, as emacs has
to reallocate the buffer and duplicate it with a new gap (you may have
experienced this when editing huge documents in emacs).

I think this strategy is not appropriate for editors with scroll bars,
where people are used to jump randomly in the document way more than a
classic emacs user, but it's simple and it works well enough.

Abiword (as MS Word) uses a "piece table", it's actually fairly
simple. You have 2 arrays, one of them is immutable and contains the
original string of characters to edit (the "initial array"), the other
one (the "change array") is an empty array with some free space at the
end (like a std::vector<char>). We will only append characters to the
change array.

You also need a list (can be a linked list, more on that later) of
"Node"s, each one of them points to a part of the text, they will
contain:

- the array that they are pointing to ("initial" or "change")
- the offset of the text inside this array
- the size of the text that they point to

For instance, if your arrays are:

Initial: [This is an example]
Change: []

and you have a Node with:

buffer: Initial
offset: 5
size: 2

This node points to the word "is". You start con a list of Nodes of 1
element, the first and only Node will be:

buffer: Initial
offset: 0
size: size of Initial (18 in our example)
(from now on, I will represent this like [Initial, 0, 18])

This obviously points to the entire original text. When the user
inserts a new character, it gets appended to the change buffer, and
Nodes are created / splitted to represent the new text. For instance,
let's say that you want to insert the character "X" at the end of
"This" in the above example, you will end up with the following values
in the buffers and in the list of nodes:

Initial: [This is an example]
Change: [X]
[Initial, 0, 4], [Change, 0, 1], [Initial, 4, 14]

To delete a character you don't delete anything in the buffers, you
just change a Node from the list of Nodes (splitting it, removing it
or shrinking it, depending of the position of the character).

Now, insertion and deletion don't depend anymore on the size of the
document, but on the number of modifications that the user has
performed in the document. Lookup is also slower, as now it has to
find the right piece for a particular offset, again an operation that
depends on the number of modifications (O(M)).

That fine for MS Word, but at least in Abiword the pieces have been
used not only to store the modifications, but also styles. Ie, a node
needs to have a uniform style (font, size, etc.). So an empty document
doesn't start with a single Node, but with as many nodes as different
spans of text you have in the document. The number of Nodes is thus
quite big, and this O(M) complexity in lookup, insert and delete is in
practice quite expensive.

Martin did a few years ago a quick hack to improve lookup, after every
mutation to the piece table, he walked the entire list of Nodes (or
"pieces") and created a vector that stored these Nodes. This way you
could visit them faster in the future (still a O(M) operation, but
it's faster to lookup contiguous pieces of memory than non
contiguous). Insert and deletion became even slower, but lookup became
faster.

The data structure that I devised was a way to store this "list of
Nodes" in such a way that all the operations are worst case O(log(M)).
From here, you can read
http://e98cuenc.free.fr/wordprocessor/piecetable.html and it should
hopefully make sense.

> Second, from what I read online, leaf nodes in red black trees seem to be
> null pointers, sentinel nodes, or some other empty structure.  From this
> discussion, however, it seems like Abiword's leaf nodes are actually
> fragments.  Is this true?

In my piece table, the leaf nodes are sentinel nodes. Ie, ALL the
pointers to a leaf node, are pointing to the same node, m_pLeaf.
A pointer to m_pLeaf means "there is nothing here". You can think of
it as NULL. I choose to point to a shared object instead of just
putting NULL because it simplifies the algorithms, for instance you
can assume that you always have a Node struct when you get the left or
right child of a Node. Sometimes we write in the parent of m_pLeaf,
but we never read from it.

Abiword's leaf nodes are also m_pLeaf, IIRC. Abiword's nodes have a
pointer to a pf_Fragment (in my piece table, it will be my small
pt_Piece struct (in pt_Piece.hpp).

> Third, I'm a little confused as to what the key for each fragment is.  In
> your explanation, you switch from a key based on piece position and piece
> size (the former being unique)

In the original implementation the only key (implicit) was the
position of the Node in the double linked list with its size.

> to a key based on piece size and the summed
> size of all the pieces in the left and right child branches of the parent
> node.

No, it's the summed size of all the pieces in the left branches (not
in the left and right branches).

>  How does that result in a unique key and what exactly _is_ the key?
> Is it the "offset" that you mention in your explanation?  Where is the
> offset being calculated from (e.g., offset from the root node, offset of the
> fragment in the piece table, offset of the fragment in the document in terms
> of absolute characters from the beginning of the document)?  If the key is
> the offset, how are size, size_left, and size_right used to come up with it?

Please let me know if this is still confusing after the explanation I
wrote above.

Cheers,

> Fourth, I was just wondering what pt_PieceTree.cpp is about and how it will
> modify/replace/work with pt_Fragment.cpp, etc...
>
> Thanks,
> Ersin
>
> On Thu, May 20, 2010 at 11:30 AM, Joaquin Cuenca Abela
> <joaquin@cuencaabela.com> wrote:
>>
>> On Thu, May 20, 2010 at 4:14 PM, Joaquin Cuenca Abela
>> <joaquin@cuencaabela.com> wrote:
>> > Arghh, sorry, I'll do it as soon as I get back home.
>> >
>> > On May 20, 2010 3:19 PM, "Martin Sevior" <msevior@gmail.com> wrote:
>> >
>> > HI Joaquin,
>> >
>> > Thanks very much for your careful review of the code. However the
>> > erase method is in pt_PieceTree.cpp not pt_PieceTable.cpp. Could you
>> > send us the updated pt_PieceTree.cpp?
>> >
>> > Thanks!
>> >
>> > Martin
>> >
>> >
>> > On Thu, May 20, 2010 at 9:07 PM, Joaquin Cuenca Abela
>> >
>> > <e98cuenc@gmail.com> wrote:
>> >> Hi guys,
>> >>
>> >> Martin, I think you're right and the erase method is inde...
>>
>>
>>
>> --
>> Joaquin Cuenca Abela
>
>
>
> --
> What Digital Revolution? -- www.whatdigitalrevolution.com
> Thinking critically about digital worlds.
>

--
Joaquin Cuenca Abela
Received on Sat, 22 May 2010 22:43:08 +1000

This archive was generated by hypermail 2.1.8 : Sat May 22 2010 - 14:43:19 CEST