Data Structures and Algorithms: 1 Algorithm Analysis and Recursion
Data Structures and Algorithms: 1 Algorithm Analysis and Recursion
1
1.1 Notation
In general, the algorithms we are interested in will manipulate a number of data items,
arranged in some sort of data structure. Let us call this number . For each of our pro-
posed algorithms, we can derive a function which will show how the complexity
(space or time) of that algorithm is tied to the number of data items. For example we
see that, given items arranged in a list1 , we require an average of steps down this
list to find a given item. We could thus say that the time complexity is
Of course, the actual computer program that performs this “list traversal” will require
many more instructions to move through the list and perform the various tests that are
required. However, this extra overhead will simply result in a larger value multiplying
as well as an additional constant. Thus the expression remains a polynomial of order
1 — this is a linear relationship (of the form ).
When we analyse the complexity of an algorithm, we do not wish to consider issues
such as the precise value of the constants referred to above. Usually we wish to see how
the algorithm will perform for very large , since most of our data structure will need
to accommodate large amounts of data. So far we have only seen a linear expression for
. Unfortunately for most algorithms the complexity is somewhat worse: we may
have a quadratic function, for example. If we wish to compare two different algorithms
for accessing a data structure, for example, we need a way of comparing them sensibly.
Given the following two complexity estimates,
2
Function Name
6879
C constant
6879 logarithmic
log-squared
linear
;: quadratic
cubic
exponential
Big-Omega A function has complexity =< *)+ , (pronounced Big-Omega),
<=*)+ ,
written as
)+
mation. Big-Omega provides a lower bound estimate i.e. our complexity is guarantee
)+
to either match or exceed , while Big-Theta provides a precise growth rate for
)+
the complexity: is bother an upper and lower bound. Finally, Little-Oh is simply
Big-Oh, but with the requirement that the growth of is strictly less than .
Table 1 shows a list of functions ordered by growth rate, and is a useful one to remem-
ber.
When you are comparing functions using any of the above definitions note the follow-
ing:
E &F '(
'( $
do not include constants or lower order terms i.e. rather than
;
3
E it is clear that for G H both G '( ;:D
and G '( are
valid. You should choose the tightest (most accurate) bound.
There are a few formal rules that can be used to aid in the analysis of algorithms, as
well as some less formal more intuitive rules that make code analysis easy. The formal
rules are:
Rule 1: if D I '(*)+ , and '(KJL . , then
1. M
N 2 PORQMS *'(-)+ .UT '(VJW ,. ;
2. M
& '(-)+ JL . ;
Rule 2: if
is a polynomial of degree X ,
>? ZYM
Rule 3: logs are sub-linear:
6879 Y [ '( .
These formal rules help us to arrive at the short-cuts indicated below. We tend to use
these short-cut rules when evaluating the Big-Oh complexity of a piece of code we
have written.
FOR loops: The number of operations completed within a FOR loop is simply
'(
times the number of operations at each step ( is the loop counter). Such a loop
is thus .
\
public static int whatsitdo(int N, int p)
int sum = 0, i, j;
sum += N; sum = sum+2;
if (sum > p)
for (int i = 0; i < N*N; i++) sum += i*i;
else
for (i = 0; i < N*N; i++)
4
for (j = 0; j < N; j++)
] sum += i+j;
We see that we start with a sequence of statements: two lines of code, then a singly
nested loop followed by a doubly nested loop. The number of items is , so we derive
'(
our time complexity expression using this variable. According to the rules stated above,
a singly nested loop has complexity
'(
; however, in this case the loop counter
1.3 Recursion
Recursion is widely used in many of the more elegant algorithms required for efficient
manipulation of data items. The basic idea may seem strange at first but it the concept
is mathematically well defined, so even if it bothers you, it does rest on a sound basis!
Let us first consider recursive functions in a mathematical sense. If we have a function
we say that it is recursive if it can be expressed in terms of itself. That is
#c V Td
While this would seem to lead to a chicken-and-egg type scenario, this is not so, pro-
vided we are very careful when defining the function F. The simplest example of recur-
sion is the factorial function:
into an infinite definition. The other point to note is how the self-reference is achieved:
heiD
we cannot us on the right hand side, since we have not calculated it yet. However,
we can use terms that have been calculated — in this example !
$
To show how one would evaluate such an expression, let us trace the steps to evaluate
:
1. T(3) = 3.T(2); we now need T(2)
2. T(2) = 2.T(1); we need T(1)
5
3. T(1) = 1, by definition, thus
4. T(3) = 3.(2.(1)) = 6 and we’re done!
The idea, then, is to build up your solution based on previous data you have calculated.
Many functions can be defined recursively. In particular, algorithms designed for data
structures which are self-similar (we’ll see this later) make heavy use of recursion.
For such recursive algorithms we can define recursive relationships to estimate the
complexity, but this is a fairly advanced topic that we will avoid.
The reason we are so interested in recursion stems from the fact that almost all com-
puter languages in use today support recursive function definitions. We can thus design
simple and elegant recursive functions to solve a whole host of problems, secure in the
knowledge that we can code them up directly. A recursive function definition for fac-
torial might look something like:
\
public int fact(int N)
if (N <= 1) return 1;
else
] return N*fact(N-1);
The system will arrange for the function fact to be called with different arguments
as it recurses down towards the base case. At that point, we have all the information
we need to start evaluating the partial expressions and as a result we can evaluate the
original request. Just as we did for the mathematical definition, we require a base case
to halt the recursion. If you do not have such a check, the system will go on calling the
function until it eventually runs out of memory.
While recursive functions are “cool” they are not always appropriate. When a computer
implements a recursive function call it needs to save extra information to help it resolve
the call correctly. The more recursive calls we need, the more resources it has to
expend during evaluation. For the case of the factorial function, we can re-write it using
iteration (i.e. a loop), which is computationally equivalent but less resource intensive:
\
public int fact(int N)
int f = 1;
if (N <= 1) return f;
else
] for (int i = 2; i <= N; i++) f = i*f;
In the case of the factorial function recursion is still a valid, if expensive, alterative.
However, for some functions a recursive definition imposes additional overhead be-
cause data that has already been calculated ends up being recalculated. This is wasteful
and leads to a very slow recursive solution. This is illustrated by the following recursive
6
function (the Fibonacci Series)
feiD fe U!jc Ac b k!
and , but to calculate we need to calculate and . We thus end
up calculating (recursively!) a second time! If you compound this recalculation
&
over many different function calls it results in an enormous overhead. In this case,
'(
one can again use an iterative approach to calculate the desired number . In fact,
the complexity is versus the exponential complexity of the recursive case. The
basic lesson we learn from this example is that recursion is only worthwhile if we do
not duplicate work. We shall see a great deal more of recursion in subsequent sections.
2 Lists
A list is a data structure in which data items are accessed by moving through the col-
'( &
lection from start to end. It is thus a linear data structure since the algorithms used to
traverse and manipulate the list have complexity , for data items. The list is
an example of an abstract data type (ADT): a data structure (or object in programming
terms) with an associated set of algorithms to manipulate it. The concept of a list does
not impose an particular implementation. However, from a programming point of view
some implementations are better than others, an these have become the norm. We will
examine two of these: singly linked and doubly linked lists.
A linked lists is simply a collection of nodes (which will hold item data, at the very
least) which are connected in a special way. The use of a data-bearing node is common
to all the data structures or ADT’s that we will see; it is the way they are connected
and accessed that differentiates them. In the case of a singly linked lists (linked list for
short) we simply “link” each node to its successor. By successor we mean the node
that logically succeeds a given node: usually we impose an ordering on the data in the
list, so a “lower” value would precede a “higher” value. This is not required by the
definition however. As far as an implementation is concerned, we have have a node
object, within which we store a reference to its successor, as indicated in Figure 1.
Note that a null reference (one that is not pointing to anything) is represented by in m
the figures.
The node will store the actual data item, and may hold other information too. The
final node in the list will contain a null reference since it has no successor. Finally, to
complete the representation, we need a means to identify the front of the list, which
requires a reference to the node from which we can reach all others. We can call this
the head of the list.
7
data data data λ
Head
Figure 1: Singly linked List: Each node is linked to the next via a Java reference. We
also store a reference to the first node in the list, called head.
2.1.1 Insertion
Insertion can take place in a number of different ways: internally (within the list) or at
either end. The basic idea is simple enough:
1. create a new node with the given data item
2. find the successor and predecessor of this new node in the list
3. insert the node at this point by resetting the links appropriately
There are unfortunately a number of complications. When we insert at either end of
the list we are missing either a successor or a predecessor. The code that you write has
to to check for each of these 3 cases and must link the new node in the correct way.
'(
The different possibilities are indicated in Figure 2. Finding the position to insert the
new node is .
2.1.2 Deletion
'(
Deletion involves identifying a node in the list which is to be removed (an operation
that is ), and then unlinking this node from its successor. The predecessor of the
node is then linked to the node’s successor. As was the case with insertion, there are 3
possible cases to consider: deletion from either end or deletion from within the list.
In many cases it is necessary to move through a list of objects from either the front or
the back of the list. In this case a single-linked list becomes less attractive. If we wish
to move through the list from the back, we must first identify the end of the list (which
requires steps from the head reference) and then somehow move towards the front
of the list. This involves keeping track of the predecessor to each node as we step back
towards the head. One can, of course, maintain a reference to the last element of the
list, which we can call the tail. But even in this case we need to maintain a reference to
the predecessor as we move backwards through the list.
8
Case 1: Insert inside list
Insertion position
Head
Item to Insert
X λ
Head
Item to Insert
Head
Head
Item to Insert
Head
Head 9
Figure 2: Insertion: Insertion into a singly linked list - we consider 3 different case.
Insertion at the front, back and middle.
λ data data data λ
Head Tail
Figure 3: Doubly linked list: We now have a successor and predecessor link. We also
store a reference to the first item, Head, and a reference to the last item, Tail.
2.2.1 Insertion
Insertion for a doubly-linked list is similar to the singly-linked case: we may insert at
either end or in the middle of the list. Insertion at either end is trivial we update the
head and tail references and make the new node’s successor or predecessor point to
the original first/last node. Insertion in the middle of the list requires us to identify
either the successor or predecessor of the new node within the list, and then to up-
date the nodes involved so that there predecessor/successor links are consistent. These
operations are shown in Figure 4.
2.2.2 Deletion
10
Insert inside double list
λ λ
New Node
λ X λ
Head Tail
λ λ
X
Head Tail
Delete: middle
λ X λ
Item to delete
Head Tail
λ X λ
Head Tail
Delete: first
λ X λ
Item to delete
Head Tail
λ X λ λ
Head Tail
Figure 4: Insertion and Deletion: Some of the cases we need to consider. For deletion
we simply re-attach the links to by pass the node we wish to delete.
11
a new book to the top of the pile, and also remove books one by one from the top of
the pile. removing a book from within the stack might cause it to topple over!
Stacks support two basic operations: push and pop. A push operation places a new
item on the top of the stack. The pop operation removes the top item from the stack (it
may also return the value of that item, if desired).
Stacks are most commonly implemented using a linked list: the head reference points
to the item on the “top” of the stack, and this item may be removed with a pop (delete)
operation. After a pop operation the head will point to a new “top of stack”. We simply
use the existing list infrastructure to implement the stack.
A queue is and ADT which support a FIFO (first in first out) style of storage and
access: items are inserted at the back the queue, and are always removed from the
front of the queue. This is the ways that queues work in the real world: you leave
before someone who arrived after you! Queues have many applications in areas such as
network modelling and are usually implemented as a simple modifications of a linked
list. The head points to the front of the queue, while the tail points to the back. You can
uses either a single or double-linked list. The enqueue and dequeue operations insert
data at the back of the queue, and removed it from the front of the queue, respectively.
These operations are indicated in Figure 5.
4 Trees
Within computing the tree ADT is of fundamental importance. Trees are used when we
'(
desire more efficiency in our data access or manipulation. We have seen that in the case
of a linked list we can achieve for all the relevant operations. This is very good;
' n6 79
however, if our tree is properly constructed operations such as insertion and deletion
can be implemented in ( for data items. This is a vast improvement and
makes sorting and searching algorithms, for example, more useful. The next sections
examine trees in great detail and highlight their benefits and deficiencies.
12
λ λ
Front Back
Enqueue:
Item to insert
λ λ λ X λ
Front Back
λ X λ
Front Back
Dequeue:
λ λ
Front Back
λ λ λ
Front Back
Figure 5: A Queue: A queue is an ADT which allows insertion on one end of the list
only (the back) and removal from the other (the front). These operations are illustrated
here; it can this be implemented using a linked list.
13
Root
Figure 6: A Tree: A tree is a collection of nodes in which each node has one parent
(predecessor) and (possibly) many children (successors). We can move from the root
node to any given node in the tree.
sub-tree A tree is a recursive structure — a tree consist of a root node with a number of
other trees hanging off this node. We call these sub-trees. Each of these sub-trees
can again be viewed as a root node to which still smaller sub-trees are attached.
depth The depth of a node is the number of links we have to traverse from the root to
reach that node.
height The height of a node is the length of the longest path from the node to any leaf
node within the sub-tree.
leaf node A leaf node is one which has no children: its child links are all null refer-
ences.
sibling If a node has a parent (the node that points to it), we call any other node also
pointed to by the parent a sibling.
The generalised tree that we have introduced here, in which a node may have an ar-
bitrary number of children, is rather complex to manipulate and implement. For most
applications a binary tree will suffice. A binary tree is a tree in which every node has
at most two children. That is, a node may have 0 1 or 2 children. The two children are
usually referred to as left and right. We also talk about the left sub-tree, for example, by
which we mean the sub-tree rooted at the left child node. It is important to distinguish
between the left and the right children since the order in which they are accessed will
be important for the algorithms we present below.
The remainder of this section deals with different kinds of binary trees.
Trees are often used to obtain sub-linear time complexity for operations such as sorting
and searching of data. Database applications in particular make heavy use of complex
tree structures to speed up data queries. In order for data to be efficiently retrieved
14
20
12 30
2 17 25
13 19
Figure 7: Binary Search Tree: Every node respects an ordering property. The data
contained within its left subtree is less than the data in the node, while the data in the
right subtree is larger. This allows for efficient searching of the tree structure.
Insertion into a BST must maintain the BST ordering property. The algorithm is sur-
p
prisingly simple. Given a node with data value that we wish to insert:
1. If the tree is empty, link the root the the given node and exit.
2 We can have “larger than or equal to” if duplicate data is permitted.
15
Insert: 8,1, 13,12,20 Root
Root Root
8 8 8
1 1 13
Root Root
8 8
1 13 1 13
12 12 20
Figure 8: Insertion: If the tree is empty, we make the root point to the new node and
exit. Otherwise we move down the tree branching left and right as required until we
find an empty slot and insert the new node there.
2. If p is less than the current node data, proceed to its left child
3. otherwise, move down to its right child.
4. If the child node does not exist, slot the new node into that position and exit.
5. otherwise return to Step 2.
In other words, we branch left or right as we descend the tree looking for an empty slot
within which we can place our new node. The first empty slot we encounter will then
be filled; this involves linking the new node to the parent with the empty slot. This is
illustrated by Figure 8. We first insert 8, which simply involves linking the new node
to the root. Then, we insert the value 1: we see that 1 is less than 8, hence we move
down from the root node to its left child. This happens to be an empty slot, so we make
the left link of the root node point towards the new node (which is what we mean by
“slotting in it”). We then insert 13. Once more, we start at the root, compare the value
to be inserted, and drop left or right depending on this choice. Since 13 is greater than
8, we drop down to the right child — which happens to be an empty slot again. We
thus set the root node’s right link to point to the new node and we’re done. To insert 12
we need to move further down the tree. Starting at the root, we see that 12 is greater
than 8, so we drop to its right child. On the next step, we note that 12 is less than 13
so we drop to its left child. Once again we have found an empty slot, so we set the left
16
link of node 13 to point to our new node. Finally, we insert the value 20. Starting at
the root we drop to the right child (node 13) and then drop to the right again, landing
in an empty slot. We then set the right link of node 13 to point to the new node and we
are done.
It is worth emphasising that the link that we use (left or right) when linking a new node
to its parent is not arbitrary: it is the link that we dropped down to get to the empty slot.
In Java programming terms an empty slot is represented using a null reference. After
the insertion, the reference is updated to point to the newly inserted node.
Implementation
In general we use recursive functions to implement tree-based routines. This often
provides a more natural way of looking at trees, since they are recursive structures. In
the case of insertion, this involves finding a base case (to terminate the recursion) and
deciding how we should partition the problem so that we can recursively solve a set of
smaller sub-problems.
To accomplish insertion we can write the following Java-like code, in which we as-
sume that we have a Node class with a left and right field which contain object
references, and in int to hold our data:
\
public Node insert(int x, Node parent)
] return parent;
17
2. We have recursive function calls (we call the function from inside itself) which
are shrinking the problem at each step. This is required for recursion to be useful.
Here we decide, based on the comparison, whether to insert the data into the left
or right sub-tree. We then call insert again with the new child node as the root of
the tree we wish to insert in. At the next function call, this check and decision
are made again, and this procedure repeats until we come to the base case (null
reference). Then we can start returning the data we need to get our final result
answer back
We see that when we insert the first item, a 3 in this case, with a root value of null, then
we return without any recursion at all. The root (which is a Java object reference will
now point towards a Node with the value 3. If we then proceeded to insert the value 4,
the following would happen:
1. root = insert(4, root)
\
2. root = insert(4,root) root.right = insert(4,root.right)
]
\ \
3. root = insert(4,root) root.right = insert(4,root.right) Base case: return NodeRef
]]
\
4. root = insert(4,root) root.right = NodeRef
]
5. root = root
This is meant to illustrate the flow of logic in the recursive function calls: the braces
represent what happens within each function call. We can only return from the recur-
sion when we hit the base case. At that point we have enough information to return
something useful to the function that called us. For every function call we return the
value of the parent reference (which may have been changed). In this example the
only reference that is actually changed is root.right. The root of the tree is only
affected by the first Node we insert into the tree. If we inserted additional Nodes we
could follow the logic in a similar fashion, but it rapidly becomes cumbersome. The
basic point to note is that we can directly translate a recursive function into a Java
recursive function implementation. As long as we have an appropriate base case and
provided we set up the recursive calls correctly, the function will execute as expected.
Deletion from a BST is somewhat more complicated than insertion. We must ensure
that the deletion preserves the BST ordering property. Unfortunately simply removing
nodes from the tree will cause it to fragment into smaller trees, unless the node is a leaf
node (has no children). The basic strategy is as follows:
1. Identify the node to be removed
2. if it is a leaf node, remove it and update its parent link;
3. If it is a node with one child, re-attach the parent link to the target’s node child;
18
4. If it is a node with two children, replace the node value with the smallest value
in the right sub-tree and then proceed down the tree to delete that node instead.
These cases are illustrated in Figure 9. The third case is the most interesting and bears
some discussion. The smallest value in the right sub-tree of a node is smaller than all
other items in that sub-tree, by definition. By copying this value into the node we wish
to delete, we preserve the BST ordering property for the tree. Furthermore, we know
that the node we now have to delete (the node we searched for down the sub-tree) can
have at most 1 child! If it had two children that would imply there was a node that
contained an even smaller value (down its left child). We thus end up back at one of
the first 2 cases.
Implementation
The following routine shows how one might implement deletion. It assumes the ex-
istence of a function called public int findMin(Node X) which returns the
smallest data item starting from the tree rooted at X, which is found by following the
left link until you hit a node with a null left child.
\
public void remove(int x, Node root)
if (root == null)
System.out.println("Data item is not in tree!");
else if (x < root.data)
root.left = remove(x,root.left);
else if (x > root.data)
root.right = remove(x,root.right);
\
else // we’re here! delete the node
root.data = findMin(root.right);
] root.right = remove(root.data, root.right);
]
return root;
]
19
18 Delete 12:
18
Update parent link
8 24
8 24
12 20 30
20 30
18 18
Delete 8:
Link parent to 8’s child
8 24 24
12 20 30 12 20 30
18 Delete 18: 20
Replace 18 with 20
Delete 20
8 24 8 24
12 20 30 12 30
Figure 9: Deletion: There are 3 cases i) deleting a leaf, ii) deleting a node with one
child, and iii) deleting a node with 2 children. We cannot delete a node with 2 children
directly, so replace this data of this node with the data of its “smallest successor”, and
then proceed down the tree to delete that node.
20
Once more this function is a direct translation of the algorithm we described: we branch
left or right as we move down the tree looking for our target. Once we find it, we check
whether it has 0, 1 or 2 children and apply the appropriate deletion procedure. The
return statement in the single child case contains a compact way of generating a value
based on a simple “if-else” test. If the data item is not contained in a Node we will
end up down going down a null link; in that case we simply report this fact and we are
done.
A proper Java implementation will need additional elements: a class called Node which
will contain the key/data and a class called BST which will define the structure of the
tree (a collection of linked Nodes). Here we have assumed we are inserting integers;
a general Java implementation would insert values of type Object. In that case the
< operator can no longer be used and a Java comparison method would be required to
compare two data values. We also need to store the root object reference within the
BST class.
While insertion and deletion are required to build and maintain the tree, the manner
in which binary search trees store data means that we can traverse them in interesting
ways. We talk about “walking” the tree, which simply means that we wish to visit each
node in the tree starting from the root. Generally we perform some or other useful
operation as we walk the tree, such as printing out the data value of the current node.
We shall call this the action in the following paragraphs. The action can, of course, be
“do nothing”, although would be rather silly!
There are 3 basic ways traversing the tree. They are all recursive, meaning that when
we refer to the root node, we can actually substitute any valid node within the tree.
Inorder Walk: Starting at the root, we walk the left sub-tree. We then perform the
action on the root node. We then walk the right-subtree. Here, “walk the
left/right sub-tree” means that we apply the same procedure to the tree rooted
at the left/right child of the node we are on. Hence, the Inorder walk is recursive.
Preorder Walk: Starting at the root, we apply the action immediately. We then walk
the left sub-tree. Finally, we walk the right sub-tree.
Postorder Walk: Starting at the root, we walk the left sub-tree, then we walk the right
sub-tree. Finally, we perform the action on the node.
Level Order Walk A level order walk proceeds down the tree, processing nodes at the
same depth, from left to right i.e. level-by-level processing. This is also called a
breadth-first traversal. One can use a queue ADT to implement such a traversal,
but we shall not concern ourselves further with this scheme.
This all sounds rather bizarre, so an example is in order. The fundamental point to
remember is that the “walking” procedure is recursive: we apply the rules listed above
21
M
Inorder: A G I M N P Z
G P Preorder: M G A I P N Z
Postorder: A I G N Z P M
A I N Z
Figure 10: Inorder walk: starting at the root, we visit the left sub-tree, perform our
action on the current node, then visit the right sub-tree.
for each node we arrive at, not only the root node. The walking procedure is '( for
nodes3 .
Consider Figure 10 and suppose we perform an inorder walk. The following steps
occur:
1. We start at the root (node M).
2. We “walk the left-subtree” i.e. we drop to the node G.
3. The node G has its own left sub-tree: we “walk” that sub-tree dropping to A
4. We see that A does not have a left-subtree (it is a leaf node), so the “walk left
sub-tree” does nothing. Now, however, we can perform the action, since we
have processed the (empty) left sub-tree in its entirety. In this case, our action
is simply to print out the value of the node, “A”. We must now walk A’s right
sub-tree. But again, this is empty. So we have processed node A completely and
can return from that node.
5. We have now walked G’s left sub-tree (visited every node), so we move onto the
action for that node: we print “G”.
6. We must now walk G’s right sub-tree - we move down to the node I.
7. As before, we have to walk the left sub-tree first: it is empty, so we are done.
Then we can perform the action: we print “I”. We then walk the right (empty)
sub-tree and we’re done with node I.
8. node G has now been fully processed.
9. We have now walked M’s left sub-tree (visiting all the nodes contained therein);
we perform the action: we print “M”.
10. We now walk M’s right sub-tree: we drop to node P.
11. We walk P’s left-subtree: we drop to N.
3 That is, the number of function calls is qsrK5ut
22
12. We walk N’s left sub-tree (which is empty); we can now perform the action for
node N: we print “N”; and walk the (empty) right sub-tree, which means we’re
done with node N.
13. We have now processed P’s left sub-tree; sow we print “P”; and move down to
its right sub-tree, starting at node Z.
14. Since Z is a leaf node, it has an empty left sub-tree; we thus print “Z”; see it has
an empty right sub-tree too and so we are done with node Z.
15. We are also at this point, done with the entire tree since we have processed each
node in turn.
We note when we look at the output that the inorder traversal has a return an ordered
(alphabetical in this case) display of the data in the tree. This is very useful. For
example, if the data in each node was a record in a database, we could use an inorder
walk to display an alphabetical list of all the clients’ information.
The pre and postorder traversals work in a similar fashion, but the sub-trees are walked
in a different order, so we will generate different output. In the pre-order case, we
print out a node’s data as soon as we reach that node, before moving on to recursively
descend its left and then right sub-tree. The printout we get in this case is
vwT.x(TdyFT.z{T.|}Td~T
which does not seem particularly useful. In the postorder case, we first process the left
and then the right sub-tree (recursively) before printing out the node value:
yFTdz"Tdx(T.~TUTd|}Tdv
Again, this does not seem to be of any use. Although it is not clear at all, there are
many uses for these tree traversals. Compilers in particular make heavy use of pre
or postorder walks to handle the correct evaluation of mathematical expressions in a
programming language. Directory tools for operating systems also make use of such
traversals to accumulate disk usage information and so on.
Implementation
The implementation for the traversal schemes is simple, if you view them recursively.
As usual, we require a base case to terminate the recursion, and we must ensure that
each recursive call works on a smaller portion of the input set (the tree nodes in this
case). For the the Inorder walk we have the following:
\
public void inorder(Node root)
23
]
inorder(root.right); // walk the right sub-tree
As we can see from the code, we can only perform the action on a node once we have
processed its entire left sub-tree. Having processed both the left and right sub-trees of a
node, we back up to the parent node and carry on working there. We thus move up and
down the tree as required by the recursive function calls. We never process “perform
an action” on a node twice.
The code for the preorder walk is as follows:
\
public void preorder(Node root)
\
public void postorder(Node root)
While the BST is a good attempt at achieving sub-linear data manipulation, it has a
major flaw: it can degenerate into a list! In other words, depending on the data we
insert or the the deletions we perform our tree can end up skewed completely to one
24
1
Insert: 1, 2, 3, 4
Figure 11: Degenerate Binary Tree: Inserting ordered data into a BST ensures that it
has depth for nodes.
'( 6n79
the BST ordering property, the tree have some scheme to ensure that its height remains
. Ideally, we would want the tree to be perfectly balanced i.e. each left and
right subtree would have the same height. This is far too restrictive since we insert
items one at a time and a tree with even 2 items will, by definition4 , be unbalanced
under this scheme!
A more relaxed scheme which nonetheless guarantees '( 6n79 height, was proposed
by Adelson-Velskii and Landis and is called the AVL tree.
In the case of an AVL tree, the height of the left and right sub-trees of any node can
differ by at most one. In this case the tree is considered balanced. One can prove that
'( 6879
this results in a tree which, although deeper than a completely balanced BST, still has
a height/depth of . This is all we require to make sure that the tree is useful
for data manipulation. An example of an AVL tree is shown in Figure 12; note the
heights of each node (see the earlier definition for tree height), and how it compares
to the corresponding “optimal” binary tree. It is worthing noting that there are many
different configurations, depending on the order in which the data is inserted into the
AVL tree.
Insertion
An AVL tree uses a number of special tree transformations called rotations to ensure
4 It will consist of the root node and either a left of right child node.
25
M h=2 I h=3
A I N Z A h=0 M P h=1
h=0
h=0
Z
h=0
Figure 12: An AVL Tree: the best possible BST is indicated on the left, with a possible
AVL tree for the data on the right. The height of each node is indicated by . For an
AVL tree we can have a difference of at most 1 between the heights of the left and right
sub-trees of a node.
that the correct balance is maintained after an insertion or deletion. These rotations are
guaranteed to respect the ordering properties of the BST. A very basic example of a
rotation is given in Figure 13.
After inserting A and B the tree is still AVL: the left and right sub-trees of the root node
differ by 1. Now we insert C. In this case we break the balance property at the root
node (the node B is still balanced): the left and right sub-trees of the root node now
differ by two. In order to restore balance at the root we “rotate” B towards A, resulting
in the final arrangement of nodes. We note two things:
1. the BST ordering property is preserved, and
2. the height of the root is now as it was prior to the insertion that caused the
imbalance.
This transformation is known as a single (left) rotation about B. We can also say that we
performed a single rotation with the right child of A or even that we rotated B towards
A.
What happens in more complicated scenarios? As it happens there are only 4 possible
transformation we need to consider, regardless of the tree’s complexity. Two of these
transformations are symmetric counterparts (mirror images) so in reality there are only
two basic transformations: single and double rotations. We determine which one to use
by examining the path we followed during the insertion step.
Let us look at the single rotation first, Figure 14. As always, we assume that the tree
was balanced prior to the insertion. The figure shows one of the 4 possible scenarios in
which an insertion has violated the balance at the root node . Here, we have inserted
into an “outer” sub-tree, :
and this insertion caused the height of that this sub-tree to
26
A A A
Insert B Insert C Left Rotation about B
B B
A C
Figure 13: Basic Rotation: We can rotate the node B “towards” A to fix the imbalance.
This gives us a new root, with a new left and right sub-tree. The tree will now respect
the AVL balance property.
rotate N2 towards N1
N1 N2
N2
N1
h h+1
T1
T3
h
T2 h+1 T1 T2
T3
Figure 14: Left Rotation: If an insertion into an “outer” sub-tree caused an imbalance
at node
, we can perform a single (left) rotation from to
to restore the AVL
property.
27
N1 Rotate Right N2
N1
N2
T3
T1
T2 T2 T3
T1
Figure 15: Right Rotation: This is the mirror image of the left rotation operation.
rotate N2 towards N1
N1 N2
N2 N1
h+2 h
h
T3
T1
h+1 h
T1
T3 T2
T
2
Figure 16: Failure of Single Rotation: If the insertion took place in an internal sub-tree,
a single rotation will not restore the balance property at the unbalanced node.
28
double rotate N1, N2, N3
N1 N3
N2
h+2
N1 N2
h h+1
T1
N3
T4 T1 T T3 T
2 4
T2 T3
Figure 17: Double Rotation: Insertion into an “inner” sub-tree can be fixed by exposing
more of the structure of the sub-tree and then performing 2 rotations — one from :
to , followed by another from
towards .
Notice that a single rotation can be used in the cases where the insertion has occurred
down the outside of the tree. A double rotation is only required when we have inserted
into an internal sub-tree. The double rotation rearranges the nodes and sub-trees in the
manner indicated — Figure 17 — and also ensures that the balance of the sub-tree (and
thus the whole tree) is returned to its state prior to insertion. A double rotation can be
viewed as two single rotations: one from :
towards
, followed by one from
towards .
Essentially we use the following approach when inserting nodes into the tree:
1. Perform the insertion, noting the path we took to reach the insertion point;
2. Move up the tree from the insertion point and check for an imbalance;
3. If we reach the root and all is well, we’re done;
4. otherwise, we check to see whether the insertion was into an “inner” or “outer”
sub-tree of the left or right child of the unbalanced node.
5. For outer sub-tree insertion, we use a single rotation, otherwise we use a double
rotation.
Consider Figure 18. Note that the order in which we insert the nodes determines the
structure of the resulting AVL tree. Insertion of 1 and 7 do not cause any problems.
However, when we insert 12, the root node becomes unbalanced (indicated by a box).
We see that we have inserted into an “outer” sub-tree on the right child: we therefore
require a single rotation from 7 towards 1 in order to fix the imbalance. Note that sub-
trees are indicated with triangles — empty sub-trees are shown when necessary so you
can see which rules are are using. After the rotation we have a balanced sub-tree rooted
at node 7.
We then insert 10 and 11. Insertion of 11 causes an imbalance at node 12. We see
that we inserted into the right sub-tree of a left child node. This “zig-zag” pattern is
characteristic of a double rotation, so we resolve the right sub-tree further. In this case
29
Insert 1, 7
7
7 1 12
12
Insert 10, 11
7
7
drot: 11−>10−>12
1 12
1 11
10
10 12
11
Insert 15
7 rotate: 11−>7 11
1 11 7 12
1 10 15
10 12
15
30
Insert 8, 10.5, 9 11
7 12
1 10 15
8 10.5
11
7 10 15
1 9 10.5
31
we have the nodes 10, 11 and 12 which will be involved in the double rotation. All
the sub-trees are empty, however. This does not matter! We rotate from 11 towards
10, and then from 11 towards 12, giving us the tree shown on the right. As before, this
transformation restores the local balance of the tree to its state prior to insertion, so we
do not need to fix things further up in the tree.
We now insert 15. This causes an imbalance at the root. We see that we inserted into the
right sub-tree of the right child of the root. Such an “outer” insertion is characteristic of
a single rotation. We rotate 11 towards 7 to fix the tree. We then insert 8 and 10.5 (the
fact that’s its a real number is not important). The tree remains balanced until we insert
9. This introduces an imbalance at the node 7. As before, we see that the insertion
took place into a left sub-tree of the right child of 7 — a zig-zag pattern. This tells us
that we need to use a double rotation to fix the imbalance. The nodes involved will be
7,8,10, with the sub-trees as indicated in Figure 19. The resulting tree is shown on the
right-hand side of the diagram.
Implementation
An AVL Node looks much the same as BST Node, but each node now maintains a
height field, which is the height of that node. By looking at the heights of left and right
children Nodes one can see whether a Node is unbalanced and restore the tree. Re-
member that we only require a single or double rotation to restore the tree completely.
The basic algorithm is as follows:
1. Perform recursive BST tree traversal to find the insertion point (first null Node)
2. Insert the New node
3. Recalculate the height of tree into which we inserted
4. As we back out recursively, check the heights of each left and right child node
5. If they differ by more than 1, apply the appropriate rotation, and we are done.
Note that once we have fixed the unbalanced sub-tree we are done with our task, but
we still have to return from all the recursive function calls we used during the insertion.
However, we know that the balance at all other nodes will be OK, so we will never have
to do another rotation. It is not considered good programming practice to prematurely
terminate the return of a sequence of recursive calls: if efficiency is critical, you can use
an iterative (non-recursive) implementation, but this will require more complex code.
Deletion
Deletion is significantly more difficult, if you apply the formal algorithm. However,
all you really need do is the following (at least when you’re working out examples by
hand!):
1. apply the Standard BST deletion algorithm to find and delete the node,
2. starting at the deletion point, move towards the root, checking the node balance
at each step,
32
3. if you find a node out of balance, apply a single or double rotation to fix it; to
determine which to apply,
(a) identify the sub-trees attached to the left and right children of the unbal-
anced node;
(b) note which sub-tree is deeper, an internal or external one (in the former
case, resolve more structure);
(c) identify the Nodes
? leading to this sub-tree and then apply the fix; con-
tinue upwards.
4. you may have to apply a rotation at each node as you continue towards the root.
The root itself may have to be rebalanced.
As this algorithm implies, in the worst case you would incur '( 6n79
rotations to
rebalance the tree after a deletion. This is not ideal, since a rotation requires a fair
amount of logic to implement.
Consider Figure 20. Remember that we apply the standard BST deletion strategy to
remove nodes from an AVL tree. We begin by deleting 10 (which is a leaf so this is
trivial).This causes an imbalance at node 20. To identify the kind of rotation we require
to rebalance the tree, we see that the left-subtree of the right child of 20 is deeper than
the right sub-tree (which is empty). We thus require a double rotation, resulting in a
new sub-tree with a root of 22. Unfortunately, as we continue moving up the tree we
see that out manipulations have unbalanced a node higher up! In this case the root of
the entire tree is the culprit: we can at least be sure that no additional rotations will be
required once we restore the balance of the root. We identify the left and right children
of the root node, and look at the structure of their sub-trees. We see that the outer-most
sub-tree is the deepest, which immediately tells us that a single rotation will suffice. We
thus rotate from 80 towards 30, fixing the balance of the root node and thus terminating
the deletion procedure.
We then wish to delete 30. We see that 30 has 2 children: we thus invoke the BST
deletion strategy and replace 30 with the smallest value down its right sub-tree, before
proceeding down the sub-tree to delete that node. In this case the tree balance has not
been affected. We now remove 70. This is a leaf, so we can delete it easily enough, but
in doing so we unbalance the node 60. Once more, we identify the sub-trees of the left
and right children of this node and see that we can get away with a single rotation from
22 towards 605 . Note that this rotation does NOT reduce the height of the sub-tree, but
it is sufficient to ensure that the AVL property is restored.
We then delete 25 (a leaf) which does not damage the tree. Deletion of 80 requires that
we replace it with the smallest value down its right sub-tree and proceed to delete that
node. We thus copy 90 into 80 and delete the node 90 in the right sub-tree. As usual,
we use the BST deletion process: we simply link the original 90’s child to its parent,
“bypassing” the node we wish to delete. Deletion of 22 follows the same logic.
5 Here the two sub-trees are of equal depth so both rotations would work. We use a single rotation since
its cheaper.
33
DELETE 10
30
20 80 drot 22−>25−>20
10 25 70 100
22 60 90 110
95
30 80
rotate: 80−>30
22 80 30 100
20 25
70 100 22 70 90 110
60 90 110 20 25 60 95
95
80
60 100
Delete 30: 60−>30, del old node
22 70 90 110
20 25 95
34
Delete 70 Delete 25
80 rotate: 22−>60 80
60 100 22 100
22 70 90 110 20 60 90 110
20 25 95 25 95
60 100
20
95 110
90 rotate: 60−>90
60
60 100
20 90
20
95 110
35
Finally, we delete 95, 110 and 100 in succession. The final deletion causes the tree
root to become unbalanced. We see that an outer sub-tree is deeper and thus perform a
single rotation from 60 towards 90.
Implementation
The formal implementation of deletion involves a fairly large number of test cases
according to which you select either a single or double rotation. Unfortunately, unlike
insertion, fixing a local sub-tree after deletion may not return its height to what it was
prior to that deletion. This means that the tree may become unbalanced further up,
requiring still more rotations. The standard implementation is recursive (as one would
expect), and uses the usual branching tests as you move towards the deletion point. It
mirrors the BST deletion code, until you actually perform the deletion. In this case the
various cases for fixes via rotation have to be enumerated, and the fixes applied. We are
guaranteed that after we do this the sub-tree rooted at the node out of balance will be
balanced once more. We can then recompute its height for later use and recurse back
up the tree. One tricky aspect of the code is the way in which you recalculate the node
heights efficiently as you recurse back up the tree.
There are numerous Java code implementations performing both insertion and deletion
and rather than getting bogged down in details, we refer the reader to those. In prac-
tise, the more efficient Red-Black tree ADT is used to implement a balancing scheme.
Unfortunately this is a rather sophisticated data structure, which falls beyond the scope
of these notes.
'( &
There are a multitude of tree ADT’s used for data storage and manipulation. The reason
36
v v
is one that can have at most
v
B-Trees A B-tree is an -ary tree which satisfies a set of constraints. An -ary tree
children per node. A B-tree differs from most
tree’s in that it grows level by level. The tree properties are as follows:
1. data is stored in special leaf nodes;
2. non-leaf nodes contain at most
v4e~ search keys, and
v node references;
3. root is either a leaf node, or has 2 to
v children;
4. other non-leaf nodes have between
and
v children;
5. leaf nodes are at the same depth and contain to data entries.
The values
v
and are chosen based on the application, but are usually re-
lated to disk-block size. B-trees are the primary data structure used in large
'( 6879
v
database systems since they ensure very fast query times. For a B-tree we have
for data nodes, due to -way branching.
37