Está en la página 1de 12

Data structures - Trees

Article Index
Data structures - Trees
Perfectly balanced tree
AVL and B Trees
Glossary
Page 1 of 4

Classic data structures produce classic tutorials. In this edition of Babbage's Bag we investigate
the advanced ecology of trees - perfectly balanced trees, AVL trees and B-Trees.

Trees and indexes


The tree is one of the most powerful of the advanced data structures and it often pops up in even
more advanced subjects such as AI and compiler design. Surprisingly though the tree is
important in a much more basic application - namely the keeping of an efficient index.

Whenever you use a database there is a 99% chance that an index is involved somewhere. The
simplest type of index is a sorted listing of the key field. This provides a fast lookup because you
can use a binary search to locate any item without having to look at each one in turn.

The trouble with a simple ordered list only becomes apparent once you start adding new items
and have to keep the list sorted - it can be done reasonably efficiently but it takes some advanced
juggling. A more important defect in these days of networking and multi-user systems is related
to the file locking properties of such an index. Basically if you want to share a linear index and
allow more than one user to update it then you have to lock the entire index during each update.
In other words a linear index isn't easy to share and this is where trees come in - I suppose you
could say that trees are shareable.

Tree ecology
A tree is a data structure consisting of nodes organised as a hierarchy - see Figure 1.
Figure 1: Some tree jargon

There is some obvious jargon that relates to trees and some not so obvious both are summarised
in the glossary and selected examples are shown in Figure 1.

I will try to avoid overly academic definitions or descriptions in what follows but if you need a
quick definition of any term then look it up in the glossary.

Binary trees
A worthwhile simplification is to consider only binary trees. A binary tree is one in which each
node has at most two descendants - a node can have just one but it can't have more than two.

Clearly each node in a binary tree can have a left and/or a right descendant. The importance of a
binary tree is that it can create a data structure that mimics a "yes/no" decision making process.

For example, if you construct a binary tree to store numeric values such that each left sub-tree
contains larger values and each right sub-tree contains smaller values then it is easy to search the
tree for any particular value. The algorithm is simply a tree search equivalent of a binary search:

start at the root


REPEAT until you reach a terminal node
IF value at the node = search value
THEN found
IF value at node < search value
THEN move to left descendant
ELSE move to right descendant
END REPEAT
Of course if the loop terminates because it reaches a terminal node then the search value isn't in
the tree, but the fine detail only obscures the basic principles.

The next question is how the shape of the tree affects the efficiency of the search. We all have a
tendency to imagine complete binary trees like the one in Figure 2a and in this case it isn't
difficult to see that in the worst case a search would have to go down the to the full depth of the
tree. If you are happy with maths you will know that if the tree in Figure 2a contains n items
then its depth is log2 n and so at best a tree search is as fast as a binary search.

Figure 2a: The "perfect" binary tree .

The worst possible performance is produced by a tree like that in Figure 2b. In this case all of
the items are lined up on a single branch making a tree with a depth of n. The worst case search
of such a tree would take n compares which is the same as searching an unsorted linear list.

So depending on the shape of the tree search efficiency varies from a binary search of a sorted
list to a linear search of an unsorted list. Clearly if it is going to be worth using a tree we have to
ensure that it is going to be closer in shape to the tree in Figure 2a than that in 2b.
Figure 2b: This may be an extreme binary tree but it still IS a binary tree

Data structures - Trees


Article Index
Data structures - Trees
Perfectly balanced tree
AVL and B Trees
Glossary
Page 2 of 4

All a question of balance


You might at first think that the solution is always to order the nodes so that the search tree a
perfect example of the complete tree in Figure 2a.

The first problem is that not all trees have enough nodes to be complete. For example, a tree
with a single node is complete but one with two nodes isn't and so on. It doesn't take a genius to
work out that complete trees always have one less than a power of two nodes. With other
numbers of nodes the best we can do is to ask that a tree's terminal nodes are as nearly as
possible on the same level.

You can think of this as trying to produce a tree with `branches' of as nearly the same length as
possible. In practice it turns out to be possible always to arrange a tree so that the total number
of nodes in each node's right and left sub-trees differ by one at most, see Figure 3.
Figure 3: A balanced tree

Such trees are called perfectly balanced trees because they are as in balance as it is possible to
be for that number of nodes. If you have been following the argument it should be obvious that
the search time is at a minimum for perfectly balanced trees.

At this point it looks as though all the problems are solved. All we have to do is make sure that
the tree is perfectly balanced and everything will be as efficient as it can be. Well this is true but
it misses the point that ensuring that a tree is perfectly balanced isn't easy. If you have all of the
data before you begin creating the tree then it is easy to construct a perfectly balanced tree but it
is equally obvious that this task is equivalent to sorting the data and so we might as well just use
a sorted list and binary search approach.

The only time that a tree search is to be preferred is if the tree is built as data arrives because
there is the possibility of building a well shaped search tree without sorting.

For example, if you already have the perfectly balanced tree in Figure 4a and the value 2 has to
be added to it then the result is the perfectly balanced tree in Figure 4b.
Figure 4a: A perfectly balanced tree

Figure 4b: Adding (2) keeps the tree in balance.

However it isn't always possible to insert a new data value and keep the tree in perfect balance.
For example, there is no way to add 9 to the tree in Figure 4a and keep it in perfect balance
(Figure 4c) without reordering chunks of the tree. It turns out the effort expended in reorganising
the tree to maintain perfect balance just isn't worth it.

Figure 4c: Adding (9) makes the tree unbalanced

Data structures - Trees


Article Index
Data structures - Trees
Perfectly balanced tree
AVL and B Trees
Glossary
Page 3 of 4

AVL Trees
So it looks as though using trees to store data such that searching is efficient is problematic. Well there might be

One such approach is to insist that the depths of each sub-tree differ by at most one. A tree that conforms to this d

Many programmers have puzzled what AVL might stand for - Averagely Very Long tree?

The answer is that AVL trees were invented by Adelson-Velskii and Landis in 1962. Notice that every perfectly b
Figure 5: An AVL tree that isn't perfectly balanced

It turns out that an AVL tree will never be more than 45% deeper than the equivalent perfectly balanced tree. Thi

In short re-balancing an AVL is easy, as can be seen in Figure 6a & b.

Figure 6a: A simple reorganisation converts a not quite AVL tree into an AVL tree
Figure 6b: A slightly more complicated example of a reorganisation converting a not quite AVL tree into an AVL

B-Trees
At this point the story could come to an end and we could all happily use AVL trees to store data that needs to be

One of the nicest ideas to get around this problem is the B-Tree. A B-Tree is constructed using a `page' of storage

The organisation of each page, and its links to the next page, is more complicated than for a binary tree but not th

The m items on each page are entered in order and either a page is terminal or it has m+1 pointers to pages at the

A B-Tree of order n satisfies the following:

1. Every page contains at most 2n items

2. Every page, except the root page, contains at least n items

3. Every page with m items is either a leaf page or has m+1 descendants

4. All leaf pages are at the same level

You can spend a few happy hours working out how this rather abstract definition results in the friendly B-Tree in
Figure 7: A B-Tree

You should also find it easy to work out the algorithm for searching a B-Tree. Basically it comes down to starting

Data structures - Trees


Article Index
Data structures - Trees
Perfectly balanced tree
AVL and B Trees
Glossary
Page 4 of 4

Inserting values into a B-Tree


Inserting a value into an existing B-Tree so that it remains a valid B-Tree turns out to be
amazingly easy for so complex a data structure. If there is still room in the page then insertion
really is trivial. The only problem arises if the page is full, i.e. already has 2n items. In this case
the full page is split into two new pages at the same level each containing n items and one of the
items is inserted into the page above - see Figure 8.
Figure 8: Inserting an item into a B-Tree

Of course there is always the possibility that the item that you have to insert in the page above
will need that page to be split and so on, propagating splits perhaps even as far up the tree as the
root page.

Notice that the splitting operation propagating back to the root is the only way that a B-Tree
ever gets any deeper - which is a weird way to grow a tree. You can work out a similar operation
for deleting elements from a B-Tree.

The advantages of the B-Tree form of index are reasonably obvious:

You only need access at most as many pages as there are levels in the tree.

As long as pages are the same size as a disk sector, reading or writing a page only
involves a single disk access.

At worst only 50% of the disk space allocated to the index is empty.

Updating the index requires each page involved in the update to be accessed and
manipulated only once. A less obvious advantage is the local nature of the update.

In a multi-user system or LAN only the pages involved in the modification have to be
locked making it possible to share an index efficiently.

There are subtle ways of improving the performance of a B-Tree by redistributing items between
pages to achieve a better balance but this is icing on the tree.

Many database packages proclaim the fact that they are better because they use B-Trees - now
you know why. If you need to make use of B-Trees yourself then you can program everything
from scratch but there are plenty of B-Tree subroutine libraries that will save you hours of
coding and now you understand why you need one and how they work.

Credits
I have to admit that my account of B-Trees is based on the one given by Niklaus Wirth in his
classic book Algorithms+Data Structure=Programs. If you need a more complete but more
abstract approach complete with code fragments then try to get hold of a copy. Alternatively turn
to The Algorithm Design Manual or one of other books in the side panel.

Glossary
ancestor a node above
binary tree each node has at most two descendants
descendant a node below

degree of node the number of direct descendants

degree of tree the maximum degree of any node in the tree


depth of tree the maximum level of any node
interior node any node that isn't a terminal node
internal path length the sum of all the path lengths
leaf a final node
the root is at level one, its direct descendants
level
are at level 2 and so on

ordered tree one in which the order of the descendants from each node is important
path length of node the distance of the node from the root
root the first node
sub-tree a complete set of nodes connected to any given node
terminal node same as leaf

También podría gustarte