Background Reading - R Tree With Examples
Background Reading - R Tree With Examples
D
University of Passau
Proseminar:
Algorithms and Datastructures for
Database Systems
SS 2003
R-Tree
Sebastian Käbisch
18 June 2003
Moderation by Thomas Bernreiter
CONTENTS 1
Contents
1 Introduction 2
3 Algorithms 6
3.1 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 SplitNode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.1 Quadratic-Cost . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.2 A Linear-Cost Algorithm . . . . . . . . . . . . . . . . . . . 16
4 Performance Tests 17
4.1 Results of inserting records . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Results of searching . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Results of deleting records . . . . . . . . . . . . . . . . . . . . . . 19
4.4 More performance tests . . . . . . . . . . . . . . . . . . . . . . . . 19
5 Problems 20
6 Developments 21
7 Summary 22
8 Bibliographie 23
1 INTRODUCTION 2
1 Introduction
One of the huge demands in geo-data applications is to response very quickly
to spatial inquiry. Spatial data objects often cover areas in multi-dimensional
spaces. The inquiry of the multi-dimension prevents from using classical index-
ing structures, for instance the B-Tree[Kemp01]. The reason is that database use
one-dimensional indexing structures. However, in modern information processing
like CAD (Computer Aided Design), cartography and multimedia applications
use multi-dimensional data objects which means that the objects have more at-
tributes. Thus, the database system needs an efficient multi-dimensional index
structure.
The following paper is concerned with the R-Tree. Firstly, it will outline the
structure of a tree. In the following, the algorithms for searching, inserting and
deleting will be introduced. Finally, some results of R-Tree index performance
tests, some problems and developments of R-Trees will be presented.
On the basis of spatial data the examples in the paper are just two-dimensional.
This is useful for clarification. However, it should not fall into oblivion that a
R-Tree can represent spatial data in several dimensions.
A, B and C are the root nodes. A, for instance, covers child nodes D, E, F
and G, and comprises them with a minimal bounding rectangle.
(I , tuple − identifier )
where tuple-identifier refers to a tuple in the database and I is an n-dimensional
rectangle which is the bounding box of the spatial object indexed.
(I , child − pointer )
where child−pointer is the address of a lower node in the R-Tree and I covers
all rectangles in the child node’s entries.
Figure 5: Underflow
There are now less then m = 3 entries in the node. Thus the tree has to be
reorganised.
3 ALGORITHMS 6
Figure 6: Overflow
3 Algorithms
Here the algorithms are represented as pseudo-code. In the pseudo-code the
rectangle parts of an index entry E are denoted by E.I and the tuple−identif ier
or child − pointer part is denoted by E.p. The paper will show an example of
every algorithm. It will always use the same tree with the values M = 3 and
m = 1. In insertion and deletion is used the SplitN ode algorithm which is
explained later in this section (3.4).
3.1 Searching
The search algorithm is similar to that of the B-Tree. It returns all qualifying
records which the search rectangle overlaps. The algorithm descends the tree
from the root. In the same time the algorithm checks the rectangle overlapping
in the node with the searched rectangle. If the test is positive, the search just
descends to the found overlapping nodes. This procedure is repeated until the
leaf node. If the entries of the leaf node overlapps the searched rectangle then
return this entries as a qualifying record.
Search pseudo-code:
T is root node of an R-Tree, find all index records whose rectangles overlap
a search rectangle S.
S1 [Search Subtree] If T no leaf then check each entry E, weather E.I overlaps
S. For all overlapping entries, start Search on the subtree whose root node
is pointed to by E.p.
S2 [Search leaf node] If T is a leaf, then check each entry E wheather E.I
overlaps S. If so, E is a suitable entry.
3 ALGORITHMS 7
3.1.1 Example
In the Figure 7 there is a filled rectangle which is the search rectangle.
Figure 7: Searching
The algorithm is looking for qualifying records in the filled area. In Figure 8
one can see the way which was chosen by the algorithm:
Figure 8: Searching
The filled rectangle overlaps the root entries R1 and R2, so the algorithm
checks these entries. In R1 there is just R4 which overlaps the filled rectangle.
Its entries are also checked. The algorithm arrives at the leaf node level. The
entries of the leaf node are checked for qualifying records. R11 is the only one
and so a first search result. In R2 there are two rectangle overlaps with the filled
rectangle: R5 and R6. Both of them are checked and the algorithm recognises
3 ALGORITHMS 8
that in the leaf node level the entries R13, R15 and R16 overlap with the search
rectangle. Finally, the search result is R11, R13, R15 and R16.
3.2 Insertion
Inserting index records for new data is similar to insertion into a B-Tree. New
data is added to the leaves, nodes that overflow are split, and splits are propa-
gated up the tree. The insertion algorithm is more complex than the searching
algorithm because inserting needs some help methods.
Insert pseudo-code:
AT3 [Adjust covering MBR in parent entry] P is the parent node of N and EN
the entry of N in P . EN .I is adjusted so that all rectangles in N are tightly
enclosed.
AT4 [Propagate node split upward] If N has a partner N N which was split
previously then create a new entry with EN N .p pointing to N N and EN N .I
enclosing all rectangles in N N . If there is room in P then add EN N .
Otherwise start SplitNode to get P and P P which include EN N and all
old entries of P .
3.2.1 Example
A rectangle R21 is inserted (filled rectangle in Figure 9).
To find the best position for the new rectangle the algorithm starts with
ChooseLeaf. The following figure shows the way of ChooseLeaf .
3 ALGORITHMS 10
The first step is clear because R21 is in R1. Next ChooseLeaf chooses R3
because this rectangle needs less enlargement than R4. At the last step the
algorithm finds the leaf node, however, all entries are full. Thus, it comes to a
node split which the following figure shows.
SplitN ode tries to minimize rectangles as much as possible. That is the reason
why the algorithm puts R21 and R9 in rectangle R3. R8 and R10 are put in the
new parent rectangle R30 . R30 is conveyed to AdjustT ree where is it propagated
upward. Since there is enough room to include R30 , it’s not necessary to split this
node again. R3 must be adjusted as well because it only points to R9 and to the
new rectangle R21. At last, root node R1 is also adjusted because it includes a
new entry R30 . So the structure of the tree is saved. The Insertion is now finished
and the following figure shows the new included rectangle.
3 ALGORITHMS 11
3.3 Deletion
Deletion in R-Trees is different from deletion in B-Tree. The following pseudo-
code will show it. Like insertion, the deletion needs help methods.
Delete pseudo-code:
D1 [Find node containing record] Start FindLeaf to find the leaf node L con-
taining E. If search unsuccessful then terminate.
D4 [Shorten tree] If the root node has only one child after adjusting then make
the child the new root.
FindLeaf pseudo-code:
Root node is T , the leaf node containing the index entry E is to find.
FL1 [Search subtree] If T is not a leaf, then check each entry F in T to determine
when F.I overlaps E.I. For all these entries FindLeaf starts on the subtree
3 ALGORITHMS 12
whose root is pointed to by F.p until E is found or each entry has been
checked.
FL2 [Search leaf node for record] If T is a leaf, then check each entry to see when
it matches E. If E is found, then return T .
CondenseTree pseudo-code:
Given is a leaf node L from which an entry has been deleted. If L has too
few entries then eliminate it from the tree. After that, the remaining entries in L
are reinserted in the tree. This procedure is repeated until the root. Also adjust
all covering rectangles on the path to the root, making them smaller, if possible.
CT2 [Find parent entry] If N is the root, then go to CT 6. Else P is the parent
node of N , and EN the entry of N in P .
CT3 [Eliminate underflow node] If N has fewer than m entries, then eliminate
EN from P and add N to list Q.
CT4 [Adjust covering rectangle] If N has not been deleted, then adjust EN .I to
tightly contain all entries in N .
CT6 [Re-insert orphaned entries] Every entry in Q is inserted. Leaf nodes are
inserted like in Insertion. However, entries from higher-level nodes must
be placed higher in the tree, so that leaves of their dependent subtrees will
be on the same level as leaves of the main tree.
In the CondenseT ree lies the difference to the B-Tree. Firstly, if a node has an
underflow, it is eliminated and inserted again. In a B-Tree, however, the node is
fused with an other node. Secondly, the R-Tree is more efficient: Implementation
of deletion is easier because Insertion routine can be used. Through the deletion
and reinsertion the spatial structure of the tree is incrementally refined.
3.3.1 Example
In the following example nothing is changed except that m = 2 and M = 4. R11
and R12 are visible records for clarification (Figure 13). Record c is deleted. At
first, the delete algorithm starts F indLeaf to get the position of c.
3 ALGORITHMS 13
With the new value m = 2, R11 has an underflow. It is eliminated from the
tree but the last entry d of R11 is saved in list Q. Now R4 has an underflow.
The entry R11 of R4 are also set in the list Q, and R4 is eliminated.
3 ALGORITHMS 14
Firstly, the node R3 has to be placed in the same level again where it was
before having been set in Q. After that leaf node R12 is reinserted in R5 because
R5 nearest rectangle which has to enlarge least. Finally, record b is inserted in
R12 because it is the nearest rectangle which hast to enlarge least. CondenseT ree
is finished. The root node has only one child and thus the child is the new root.
The following Figure shows the new structure of the tree after deletion.
3.4 SplitNode
In the case of adding a new entry to a full node containing M entries, it is neces-
sary to divide the collection of M + 1 entries between two nodes. Insertion and
Deletion have to use this method to save the tree structure. The division should
be done in a way that makes it as unlikely as possible that both new nodes will
need to be checked on subsequent searches. The total area of the two covering
rectangles after a split should be minimized. Following figure shows a ’good’ split
and a ’bad’ split.
For SplitN ode there are three versions of algorithms which each have dif-
ferences in quality and complexity. The paper will just introduce two of them
because the ExhaustiveAlgorithm is normally not used. It is the best split
algorithm in quality because it finds the best way to minimize the area of all
rectangles of the R-Tree. The cost, however, would be 2M −1 and so the algorithm
would be too slow with a large node size.
3.4.1 Quadratic-Cost
This algorithm tries to find a small-area split, however, it is not guaranteed that
it finds one with the smallest area possible. Quadratic − Cost chooses two of the
M +1 entries which use most of the area and puts them in new nodes. Concerning
the remaining entries that entry is selected which needs the largest area if it is
inserted in one of the two nodes. The algorithm then puts the selected entry in
that node where less enlargement is needed. The procedure is repeated until all
nodes are divided or one node has less then m entries. The cost of Quadratic-Cost
is M 2 .
3 ALGORITHMS 16
QS1 [Pick first entry for each group] start PickSeeds to find two entries to be
the first elements of the groups. Assign each to a group.
QS2 [Check if done] If all entries have been assigned, then break up. If a group
has too few entries that all the rest must be assigned to it in order for it to
have the minimum number m, then assign them and break up.
QS3 [Select entry to assign] start PickNext to choose the next entry to assign.
This is put in the group whose area has to be least enlarged. If the algorithm
enlarges both groups with the same size, then add to the group with smaller
area, then to the one with fewer entries, then to either. Repeat from QS2.
PickSeeds pseudo-code:
PS1 [Calculate inefficiency of grouping entries together] For all pairs of entries
E1 and E2 a rectangle J is created which includes E1 .I and E2 .I.
Calculate
d = area(J) − area(E1 .I) − area(E2 .I)
PS2 [Choose the most wasterful pair] Choose the pair with the largest d.
PickNext pseudo-code:
PN1 [Determine cost of putting each entry in each group] For each entry E which
is not in a group yet, d1 is calculated. d1 is the area-increase required in
the covering rectangle of group 1 to include E.I and also d2 for group 2.
PN2 [Find entry with greatest preference for one group] Select any entry with
the maximum difference between d1 and d2 .
LinearPickSeeds pseudo-code:
LPS1 [Find extreme rectangles along all dimensions] In each dimension the entry
is found whose rectangle has the highest low side and the one with lowest
high side. Save the separation.
LPS2 [Adjust for shape of the rectangle cluster] Normalize the seperations by
dividing by the width of the entire set along the corresponding dimension.
LPS3 [Select the most extreme pair] Choose the pair with the greatest normalized
separation along any dimension.
4 Performance Tests
Antoine Guttman also presents in his book different tests with the algorithms.
This paper will present some results in order to get an idea of costs.
Guttman tried to find out the best performance and the best value for M and
m. For this test 5 pages sizes were used (Figure 18). The minimum number of
entries in a node were M2 , M3 and 2. All tests used two-dimensional data. The
implemention was in C under U nix on a V ax 11/780 computer.
Each test began to run the program by reading in geometry data (layout data
from the RISC − II computer chip) with 1057 rectangles from files and insert-
ing in an empty tree. After that, the program called the function Search and
searched rectangles made up by using random numbers. Finally, the test read
the input file again and called the function Delete to remove the index record.
Following short cuts are used in the diagramms: E (Exhaustic algorithm), Q
(Quadratic algorithm) and L (Linear algorithm).
4 PERFORMANCE TESTS 18
The Exhaustive algorithm needs a lot of time with already less pages. The
linear algorithm is fastest, as expected. With more bytes per pages the CPU cost
doesn’t increase so much.
The diagrams show almost the same result with the different algorithm. The
reason is that Searching doesn’t use SplidN ote. However, the results show that
the Exhaustive algorithm which produces a better index structure has the best
values.
4 PERFORMANCE TESTS 19
The result was strongly affected by the minimum node fill requirement. If the
value of m is small, the nodes often become an underf low . Thus, their entries
must be reinserted and reinsertion sometimes causes nodes to split because of an
overf low.
The quadratic algorithm is nearly constant from 2500 records through inser-
tion except where the tree increases in height. The linear algorithm doesn’t use
so much cost through insertion and so there is no jump in the curve. Deletion
with the quadratic configuration produced only 1 to 6 node splits. That’s the
reason why the curve is very rough. In the linear configuration there are no node
splits and so the curve shows only a small jump. The test shows that the cost is
independent of tree width but is affected by tree height, which grows slowly with
the number of data items.
The last diagramms show the fact that almost all space in an R-Tree index is
used for leaf nodes.
For the linear configuration the total space occupied by the R-Tree about 40
bytes per data item, compared to 20 bytes per item for the index records alone.
The quadratic configuration was 33 bytes per item.
5 Problems
Some problems arise from the realisation of an R-Tree [Saak01]. Specially searching
and inserting show some negative effects:
When handling with small and favourable records, these problems are not very
significant. However, if there a lot of unfavourable and multi-dimension records
the efficiency is quite impaired.
6 Developments
In the course of time the R-Tree was improved, specially structure. Additionally,
there was specialization for particular application. Here are some variants of the
R-Tree.
R+ -Tree [Sell87]
This tree tries to minimized the overlapping of regions. Here, objects are saved
disjunctive. The search algorithm is faster but the tree structure is more com-
plicated.
R*-Tree [Beck90]
The SplitN ode algorithm takes volume and the extend of overlapping into con-
sideration. All other properties are similar to the R-Tree.
X-Tree [Berc96]
It includes a Split − History and a function for node enlargement. This pre-
vents overlapping of regions.
7 SUMMARY 22
7 Summary
In our modern application world the using of spatial index structure is unimag-
inable. To handle spatial data efficiently a database system needs an special
index mechanism. This paper introduced the first kind of such an mechanism,
Guttman’s R-Tree. This tree is a dynamic index structure.
The algorithms are quite similar to ones of the B-Tree except for the delete al-
gorithm. Deletion eliminates a node which has a underflow and insert it into the
R-Tree again. The SplitN ode algorithm is an important part of Insertion and
Deletion. Three different implementation decide which quality the tree structure
has and how the cost is.
Even though the R-Tree has some efficiency problems if there are a lot of
unfavourable and multi-dimension records, it’s still was a great achievement and
opened the door to handling spatial data indexes. In the course of time many
new variants of R-Tree were developped to improve the efficiency and thus to
improve complex applications.
8 BIBLIOGRAPHIE 23
8 Bibliographie
[Kemp01] A. Kemper/A. Eickler: Datenbanksysteme, Eine Einfuehrung, 4. Au-
flage, S. 207-211, 2001
[Berc96] S. Berchtold, D.A. Keim, H.-P. Kriegel: The X-Tree: An index structure
for high-dimension data, Proceedings of the 22nd International Conferene on Very
Large Databases, Bombay, India, 1996