R Tree
R Tree
Abstract
In order to handle spatial data efficiently, as required in computer aided design and
geo-data applications, a database system needs an index mechanism that will help it
retrieve data items quickly according to their spatial locations. However, traditional
indexing methods are not well suited to data objects of non-zero size located in.multi
dimensional spaces. In this paper we describe a dynamic index structure called an R-tree
which meets this need, and give algorithms for searching and updating it. We present the
results of a series of tests which indicate that the structure performs well, and conclude
that it is useful for current database systems in spatial applications.
1. In tro d u c tio n An index based on objects’ spatial loca
Spatial data objects often cover areas tions is desirable, but classical one
in multi-dimensional spaces and are not dimensional database indexing structures
well represented by point locations. For are not appropriate to multi-dimensional
example, map objects like comities, census spatial searching. Structures based on
tracts etc. occupy regions of non-zero size exact matching of values, such as hash
in two dimensions. A common operation on tables, are not useful because a range
spatial data is a search for all objects in an search is required. Structures using one
area, for example to find all counties that dimensional ordering of key values, such as
have land within 20 miles of a particular B-trees and ISAM indexes, do not work
point. This kind of spatial search occurs because the search space is multi
frequently in computer aided design (CAD) dimensional.
and geo-data applications, and therefore it A number of structures have been pro
is important to be able to retrieve objects posed for handling multi-dimensional point
efficiently according to their spatial loca data, and a survey of methods can be
tion. found in [5]. Cell methods [4, 8,16] are not
good for dynamic structures because the
cell boundaries must be decided in
advance. Quad trees [7] and k-d trees [3]
This research was sponsored by National do not take paging of secondary memory
Science Foundation grant ECS-8300463 into account. K-D-B trees [13] are
and Air Force Office of Scientific Research designed for paged memory but are useful
grant AFOSR-83-0254. only for point data. The use of index inter
Permission to copy without fee all or part of this material is granted vals has been suggested in [15], but this
provided that the copies are not made or distributed for direct method cannot be used in multiple dimen
commercial advantage, the ACM copyright notice and the title of the
publication and its date appear, and notice is given that copying is by
sions. Comer stitching [12] is an example
permission of the Association for Computing Machinery. To copy of a structure for two-dimensional spatial
otherwise, or to republish, requires a fee and/or specific permission. searching suitable for data objects of non
zero size, but it assumes homogeneous pri
© 1984 ACM 0-89791-128-8/84/006/0047 $00.75 mary memory and is not efficient for ran
dom searches in very large collections of
data. Grid files [10] handle non-point data
by mapping each object to a point in a
higher-dimensional space. In this paper we (1) Every leaf node contains between m
describe an alternative structure called an and M index records unless it is the
R-tree which represents data objects by root.
intervals in several dimensions. (2) For each index record
Section 2 outlines the structure of an (I, tu p le-id en tifier) in a leaf node, I is
R-tree and Section 3 gives algorithms for the smallest rectangle that spatially
searching, inserting, deleting, and updat contains the n-dimensional data object
ing. Results of R-tree index performance represented by the indicated tuple.
tests are presented in Section 4. Section 5 (3) Every non-leaf node has between m
contains a summary of our conclusions. and M children unless it is the root.
2. R -T ree In d e x S tru c tu re (4) For each entry (/, child —pointer ) in a
An R-tree is a height-balanced tree non-leaf node, I is the smallest rectan
similar to a B-tree [2,6] with index records gle that spatially contains the rectan
in its leaf nodes containing pointers to gles in the child node.
data objects. Nodes correspond to disk (5) The root node has at least two children
pages if the index is disk-resident, and the unless it is a leaf.
structure is designed so that a spatial (6) All leaves appear on the same level.
search requires visiting only a small Figure 2.1a and 2.1b show the structure
number of nodes. The index is completely of an R-tree and illustrate the containment
dynamic; inserts and deletes can be inter and overlapping relationships that can
mixed with searches and no periodic reor exist between its rectangles.
ganization is required. The height of an R-tree containing N
A spatial database consists of a collec index records is at most jlogm./V—1,
tion of tuples representing spatial objects,
and each tuple has a unique identifier because the branching factor of each node
which can be used to retrieve it. Leaf 771 The
nodes in an R-tree contain index record N N
entries of the form nodes is I + 1. Worst-case
771 7712
(/, tuple -id e n tifie r ) space utilization for all nodes except the
771
where tuple —identifier refers to a tuple in root is — , Nodes will tend to have more
the database and I is an n-dimensional M
rectangle which is the bounding box of the than m entries, and this will decrease tree
spatial object indexed: height and improve space utilization. If
nodes have more than 3 or 4 entries the
i =(Vi...4-i) tree is very wide, and almost all the space
Here n is the number of dimensions and 4 is used for leaf nodes containing index
is a closed bounded interval [a ,b ] describ records. The parameter m can be varied
ing the extent of the object along dimen as part of performance tuning, and
sion i. Alternatively 4 may have one or different values are tested experimentally
both endpoints equal to infinity, indicating in Section 4.
that the object extends outward 3. S earching and U p dating
indefinitely. Non-leaf nodes contain
entries of the form 3 .1 . S earching
(/, ch ild -p o in ter) The search algorithm descends the tree
where ch ild-pointer is the address of a from the root in a manner similar to a B-
lower node in the R-tree and I covers all tree. However, more than one subtree
rectangles in the lower node’s entries. under a node visited may need to be
searched, hence it is not possible to
Let M be the maximum number of guarantee good worst-case performance.
entries that will fit in one node and let Nevertheless with most kinds of data the
M update algorithms will maintain the tree in
m < — be a parameter specifying the
minimum number of entries in a node. An a form that allows the search algorithm to
R-tree satisfies the following properties: eliminate irrelevant regions of the indexed
space, and examine only data near the
To Data Tuples
(a)
Algorithm S earch. Given an R-tree whose Inserting index records for new data
root node is T, find all index records whose tuples is similar to insertion in a B-tree in
rectangles overlap a search rectangle S. that new index records are added to the
leaves, nodes that overflow are split, and
Si. [Search subtrees.] If T is not a leaf, splits propagate up the tree.
check each entry E to determine
whether E.I overlaps S. For all overlap Algorithm In s e rt. Insert a new index entry
ping entries, invoke S earch on the tree E into an R-tree.
whose root node is pointed to by E.p .
11. [Find position for new record.] AT5. [Move up to next level.] Set N=P and
Invoke ChooseLeaf to select a leaf set N N -P P if a split occurred.
node L in which to place E. Repeat from AT2.
12. [Add record to leaf node.] If L has
room for another entry, install E. Algorithm SplitNode is described in
Otherwise invoke SplitNode to obtain Section 3.5.
L and LL containing E and all the
old entries of L. 3.3. Deletion
13. [Propagate changes upward.] Invoke Algorithm Delete. Remove index record E
AdjustTree on L, also passing LL if a from an R-tree.
split was performed. Dl. [Find node containing record.]
14. [Grow tree taller.] If node split pro Invoke FindLeaf to locate the leaf
pagation caused the root to split, node L containing E. Stop if the
create a new root whose children are record was not found.
the two resulting nodes. D2. [Delete record.] Remove E from L.
D3. [Propagate changes.] Invoke Con-
Algorithm ChooseLeaf. Select a leaf node denseTree, passing L.
in which to place a new index entry E. D4. [Shorten tree.] If the root node has
CLl. [Initialize.] Set N to be the root only one child after the tree has
node. been adjusted, make thé child the
CL2. [Leaf check.] If N is a leaf, return N. new root.
CL3. [Choose subtree.] If Af is not a leaf,
let F be the entry in N whose rec Algorithm FindLeaf. Given an R-tree whose
tangle F.I needs least enlargement to root node is T, find the leaf node contain
include E.I. Resolve ties by choosing ing the index entry E.
the entry with the rectangle of smal FL1. [Search subtrees.] If T is not a leaf,
lest area. check each entry F in T to deter
CL4. [Descend until a leaf is reached.] Set mine if F.I overlaps E.I. For each
N to be the child node pointed to by such entry inyoke FindLeaf on the
F.p and repeat from CL2. tree whose root is pointed to by F.p
until E is found or all entries have
Algorithm AdjustTree. Ascend from a leaf been checked.
node L to the root, adjusting covering rec FL2. [Search leaf node for record.] If T is
tangles and propagating, node splits as a leaf, check each entry to see if it
necessary. matches E. If E is found return T.
ATI. [Initialize.] Set N=L. If L was split
previously, set NN to be the resulting Algorithm CondenseTree. Given a leaf
second node. node L from which an entry has been
AT2. [Check if done.] If N is the root, stop. deleted, eliminate the node if it has too few
entries and relocate its entries. Propagate
AT3. [Adjust covering rectangle in parent node elimination upward as necessary.
entry.] Let P be the parent node of Adjust all covering rectangles on the path
N, and let EN be N's entry in P. to the root, making them smaller if possi
Adjust En .I s o that it tightly encloses ble.
all entry rectangles in N. CT1. [Initialize.] Set N=L. Set Q, the-set
AT4. [Propagate node split upward.] If N of eliminated nodes, to be empty.
has a partner NN resulting from an CT2. [Find parent entry.] If N is the root,
earlier split, create a new entry ENN go to CT6. Otherwise let P be the
with ENN.p pointing to NN and Em .I parent of N, and let EN be N's entry
enclosing all rectangles in NN. Add in P.
Enn to P if there is room Otherwise, CT3. [Eliminate under-full node.] If N has
invoke SplitNode to produce P and fewer than m entries, delete EN from
PP containing Em and all P ’s old P and add N to set Q.
entries.
CT4. [Adjust covering rectangle.] -If N has beforehand is required by the deletion
not been eliminated, adjust EN.I to algorithm and is implemented by Algorithm
tightly contain all entries in N. FindLeaf. Variants of range deletion, in
CT5. [Move up one level in tree.] Set N=P which index entries for all data objects in a
and repeat from CT2. particular area are removed, are also well
supported by R-trees.
CT6. [Re-insert orphaned entries.] Re
insert all entries of nodes in set Q. 3.5. Node Splitting
Entries from eliminated leaf nodes In order to add a new entry to a full
are re-inserted in tree leaves as node containing M entries, it is necessary
described in Algorithm Insert, but to divide the collection of M +1 entries
entries from higher-level nodes must between two nodes. The division should be
be placed higher in the tree, so that done in a way that makes it as unlikely as
leaves of their dependent subtrees possible that both new nodes will need to
will be on the same level as leaves of be examined bn subsequent searches.
the main tree. Since the decision whether to visit a node
depends on whether its covering rectangle
The procedure outlined above for overlaps the search area, the total area of
disposing of under-full nodes differs from the two covering rectangles after a split
the corresponding operation on a B-tree, should be minimized. Figure 3.1 illustrates
in which two or more adjacent nodes are this point. The area of the covering rec
merged. A B-tree-like approach is possible tangles in the “bad split" case is much
for R-trees, although there is no adjacency larger than in the “good split" case.
in the B-tree sense: an under-full node The same criterion was used in pro
can be merged with whichever sibling will cedure ChooseLeaf to decide where to
have its area increased least, or the insert a new index entry: at each level in
orphaned entries can be distributed among the tree, the subtree chosen was the one
sibling nodes. Either method can cause whose covering rectangle would have to be
nodes to be split. We chose re-insertion enlarged least.
instead for two reasons: first, it accom
plishes the same thing and is easier to We now turn to algorithms for parti
implement because the Insert routine can tioning the set of M + 1 entries into two
be used. Efficiency should be comparable groups, one for each new node.
because pages needed during re-insertion 3.5.1. Exhaustive Algorithm
usually will be the same ones visited during
the preceding search and will already be in The most straightforward way to find
memory. The second reason is that re the minimum area node split is to generate
insertion incrementally refines the spatial all possible groupings and choose the best.
structure of the tree, and prevents gradual However, the number of possibilities is
deterioration that might occur if each approximately and a reasonable value
entry were located permanently under the
same parent node.
3.4. Updates and Other Operations
If a data tuple is updated so that its
covering rectangle is changed, its index
record must be deleted, updated, and then
re-inserted, so that it will find its way to
the right place in the tree.
Other kinds of searches besides the one
described above may be viseful, for example
to find all data objects completely con
tained in a search area, or all objects that Bad split Good split
contain a search area. These operations
can be implemented by straightforward Figure 3.1
variations on the algorithm given. A search
for a specific entry whose identity is known
of M is 50*, so the number of possible splits Algorithm PickSeeds. Select two entries to
is very large. We implemented a modified be the first elements of the groups.
form of the exhaustive algorithm to use as PSl. [Calculate inefficiency of grouping
a standard for comparison with other algo entries together.] For each pair of
rithms, but it was too slow to use with large entries and E2, compose a rectan
node sizes. gle J including E VI and E2.l. Calcu
3.5.2. A Quadratic-Cost Algorithm late d= area(</) - area (E ^I) -
This algorithm attempts to find a area {E2.I).
small-area split, but is not guaranteed to PS2. [Choose the most wasteful pair.]
find one with the smallest area possible. Choose the pair with the largest d .
The cost is quadratic in M and linear in the
number of dimensions. The algorithm Algorithm PickNext. Select one remaining
picks two of the M + 1 entries to be the first entry for classification in a group.
elements of the two new groups by choos PN1. [Determine cost of putting each
ing the pair that would waste the most entry in each group.] For each entry
area if both were put in the same group, E not yet in a group, calculate d Y=
i.e. the area of a rectangle covering both the area increase required in the
entries, minus the areas of the entries covering rectangle of Group 1 to
themselves, would be greatest. The include E.I. Calculate d z similarly
remaining entries are then assigned to
groups one at a time. At each step the for Group 2.
area expansion required to add each PN2. [Find entry with greatest preference
remaining entry to each group is calcu for one group.] Choose any entry
lated, and the entry assigned is the one with the maximum difference
showing the greatest difference between between d 1 and d 2.
the two groups.
3.5.3. A Linear-Cost Algorithm
Algorithm Quadratic Split. Divide a set of This algorithm is linear in M and in the
M + 1 index entries into two groups. number of dimensions. Linear Split is
QSl. [Pick first entry for each group.] identical to Quadratic Split but uses a
Apply Algorithm PickSeeds to choose different version of PickSeeds. PickNext
two entries to be the first elements simply chooses any of the remaining
of the groups. Assign each to a entries.
group.
QS2. [Check if done.] If all entries have Algorithm LinearPickSeeds. Select two
been assigned, stop. If one group has entries to be the first elements of the
so few entries that all the rest must groups.
be assigned to it in order for it to LPSl.[Find extreme rectangles along all
have the minimum number m , assign dimensions.] Along each dimension,
them and stop. find the entry whose rectangle has
QS3. [Select entry to assign.] Invoke Algo the highest low side, and the one
rithm PickNext to choose the next with the lowest high side. Record the
entry to assign. Add it to the group separation.
whose covering rectangle will have to LPS2. [Adjust for shape of the rectangle
be enlarged least to accommodate it. cluster.] Normalize the separations
Resolve ties by adding the entry to by dividing by the width of the entire
the group with smaller area, then to set along the corresponding dimen
the one with fewer entries, then to sion.
either. Repeat from QS2. LPS3. [Select the most extreme pair.]
Choose the pair with the greatest
*A two dimensional rectangle can be normalized separation along any
represented by four numbers of four bytes dimension.
each. If a pointer also takes four bytes,
each entry requires 20 bytes. A page of
1024 bytes will hold about 50 entries.
4 . P e rfo rm a n c e Tests
We implemented R-trees in C under
Unix on a Vax 11/780 computer, and used
our implementation in a series of perfor
mance tests whose purpose was to verify
the practicality of the structure, to choose
values for M and m, and to evaluate
different node-splitting algorithms. This
section presents the results.
Five page sizes were tested,
corresponding to different values of M:
Bytes per Page Max Entries per Page (M)
128 6
256 12
512 25
1024 50
2048 102
.2
L m = M /2
L m=2 Q m=2
L m = M /2
.1 ■
L m =2
Q m = M /2
■ ■ .......... 128 256 512 1024 2048
128 256 512 1024 2048 Bytes p e r page
Bytes p e r page
200
-E m =2 E m = M /2
100
128 256 512 1024 2048
Bytes p e r page
Figure 4.5
Figure 4.3 Search performance: CPU cost.
CPU cost of deleting records.
E = Exhaustive a lg o rith m
insensitive to the use of different node Q = Q uadratic a lg o rith m
50k
split algorithms and fill requirements. The E m=2 l = Linear a lg o rith m
Q m=2
exhaustive algorithm produces a slightly 45k
better index structure, resulting in fewer
pages touched and less CPU cost, but most Bytes 4 0 k
combinations of algorithm and fill require re q u ire d L m=2
ment come within 10% of the best. All algo 35k L m = M /2
rithms provide reasonable performance. 30k
Q m = M /2