0% found this document useful (0 votes)

30 views

The W-Tree: An Index Structure For High-Dimensional Data: King-Lp Lin, H.V. Jagadish, and Christos Faloutsos

The document proposes a new index structure called the W-Tree to address issues with indexing high-dimensional data. The W-Tree aims to use only a few features for indexing at higher levels of the tree, using additional features only when necessary to improve discrimination. Previous methods like the R-Tree struggle with high-dimensional data, as their performance degrades exponentially with increasing dimensionality. The W-Tree varies the number of features used at each tree level to maintain efficiency as dimensionality increases.

Uploaded by

raji rajesh

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

The W-Tree: An Index Structure For High-Dimensional Data: King-Lp Lin, H.V. Jagadish, and Christos Faloutsos

Uploaded by

raji rajesh

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

VLDBJournal,3,517-542 (1994), Ralf Hartmut Gtifing, Editor 517

QVLDB

The W-Tree: An Index Structure for High-Dimensional

Data

King-lp Lin, H.V. Jagadish, and Christos Faloutsos

Received July 12, 1993; accepted May 20, 1994.

Abstract. We propose a file structure to index high-dimensionality data, which are

typically points in some feature space. The idea is to use only a few of the fea-
tures, using additional features only when the additional discriminatory power is
absolutely necessary. We present in detail the design of our tree structure and the
associated algorithms that handle such "varying length" feature vectors. Finally,
we report simulation results, comparing the proposed structure with the R*-tree,
which is one of the most successful methods for low-dimensionality spaces. The
results illustrate the superiority of our method, which saves up to 80% in disk ac-
cesses.

Key Words. Spatial index, similarity retrieval, query by content.

1. Introduction

Many applications require enhanced indexing that is capable of performing similarity

searching on several, non-traditional (exotic) data types. The target scenario is as
follows: given a collection of objects (e.g., 2-D images, 3-D medical brain scans,
or simply English words), we would like to find objects similar to a given sample
object. We rely on a domain expert to provide the appropriate similarity/distance
functions between two objects. A list of potential applications for such a system
follows:
• Image databases: Jagadish (1991) showed how to query for similar shapes,
describing each shape by the coordinates of a few rectangles that cover it
(,~20 features per shape). Niblack et al., (1993) supported queries on color,
shape and texture, using color histograms (64-256 attributes per image) as
feature vectors, and using the first 20 moments for shapes.

King-Ip Lin is a graduate student, and Christos Faloutsos, Ph.D., is Associate Professor, Department of
Computer Science, Universityof Maryland, College Park, MD 20742; H.V. Jagadish, Ph.D., is with AT&T
Bell Laboratories, 600 Mountain Avenue, Murray Hill, NJ 07974.
518

• Medical databases, where 1-D objects (e.g., ECGs), 2-D images (e.g., x-rays),
and 3-D images (e.g., MRI brain scans; Arya et al., 1993) are stored. The
ability to retrieve quickly past cases with similar symptoms is valuable for
diagnosis, as well as for medical teaching and research purposes.
• Time series, such as financial databases with stock-price movements. The goal
is to aid forecasting, by examining similar patterns that have appeared in the
past. Agrawal et al. (1993) used the co-elficients of the Discrete Fourier
Transform (DFT) as features.
• Multimedia databases, with audio (voice, music) or video (Narasimhalu and
Christodoulakis, 1991). Users might want to retrieve similar music scores or
video clips.
• D N A databases that contain a large collection of strings from a four-letter
alphabet (A,G,C,T); a new string has to be matched against the old strings,
to find the best candidates. The BLAST algorithm (Altschul et al., 1990)
uses successive, overlapping n-grams for indexing. When using n-grams as
features, we need 4'~ features or 1,024 features for n = 5.
• Searching for names or addresses, (e.g., in a customer mailing list), which are
partially specified or have errors. For example "1234 Springs Road" instead
of "1235 Spring Rd," or "Mr. John Smith" instead of "Dr. J. Smith, Jr."
Similar applications include spelling, typing (Kukich, 1992), and O C R error
correction. Given a wrong string, we should search a dictionary to find the
closest strings to it. Triplets of letters are often used to assess the similarity
of two words (Angell et al., 1983), in which case we have ,~, 263 = 17,576
features per word (assuming that words consist exclusively of the 26 English
letters, ignoring digits, upper-case letters, etc.).

For all of these applications, we rely on an expert to derive features that

adequately describe the objects of interest. As 3agadish (1991) proposed, once
objects are mapped into points in some feature space, we can accelerate the search
by organizing these points in a spatial access method.
For a feature space with low dimensionality, any of the known spatial access
methods will work. However, in the above applications, the number of features per
object may range from 10 to 100. The spatial access methods of the past have mainly
concentrated on 2-D and 3-D spaces, such as the R-tree based methods (Guttman,
1984), and the linear-quadtree based methods (e.g., z-ordering; Orenstein and
Manola, 1988). Although conceptually they can be extended to higher dimensions,
they usually require time and/or space that grows exponentially with the number of
dimensions.
In this article, we propose a tree-structure that avoids the dimensionality problem.
The idea is to use a variable number of dimensions for indexing, adapting to the
number of objects to be indexed, and to the current level of the tree. Thus, for
nodes that are close to the root, we use only a few dimensions (and therefore,
we can store many branches, and enjoy a high fanout); as we descend the tree,
VLDB Journal 3 (4) Lin:The TV-Tree 519

we become more discriminating, using more and more dimensions. Given that the
feature vectors contract and extend dynamically, resembling a telescope, we called
our method the Telescopic-Vector tree, or TV-tree.
This article is organized as follows: Section 2 surveys related work, highlighting
the problems of high-dimensionality. Section 3 presents the intuition and motivation
behind the proposed method. Section 4 presents the implementation of our method,
Section 5 gives the experimental results, and Section 6 lists the conclusions.

2. Related Work

As mentioned above, feature extraction functions map objects into points in feature
space for a variety of applications; these points must be stored in a spatial access
method. The prevailing methods form three classes: R*-trees (Beckmann et al.,
1990) and the rest of the R-tree family (Guttman, 1984; Jagadish, 1990); linear
quadtrees (Samet, 1989); and grid-files (Nievergelt et al., 1984).
Different kinds of queries arise; the most typical ones are listed below:
• Exact match queries. Find whether a given query object is in the database.
For example, check if a certain inventory item exists in the database.
• Range queries. Given a query object, find all objects in the database that
are within a certain distance from the object. Similarity queries also fall
within this category. For example, find all buildings within 2 miles of the
Washington National Airport; find all words within a one-letter substitution
from the word "tex"; find all shapes that look like a Boeing 747.
• Nearest neighbor queries. Given a query item, find the item that is closest or
most similar to the query item. For example, find the fingerprint that is most
similar to the one given. Similarly, k-nearest neighbor queries can be asked.
• Allpair queries. Given a set of objects, find all pairs within distance e; or
find the k-closest pairs. For example, given a map, find all pairs of houses
that are within 100 feet of each other.
• Sub-pattern matching. Instead of looking at the objects as a whole, find a
sub-pattern within an object that matches our description. For example, find
stock movements that contain a certain pattern; or find all x-ray images that
contain tissue with tumor-like texture.

Previous work compared the performance of different spatial data structures.

Greene (1989) compared the R-tree, R+-tree, K-D-B-tree, and the 2-D Index
Sequential Access Method, and concluded that the R-tree and the R +-tree give the
better performances. Hoel and Samet (1992) compared the PMR-quadtree to the
R-tree variants for large line segment databases. Their results show that different
data structures are suited for different kinds of queries.
Most multidimensional indexing methods, however, explode exponentially with
the dimensionality, eventually reducing to sequential scanning. For linear quadtrees,
520

the effort is proportional to the hypersurface bounding the query region (Hunter and
Stieglitz, 1979); the hypersurface grows exponentialIly with the dimensionality. Grid
files face similar problems, because they require a directory that grows exponentially
with the dimensionality. The R-tree and its variants will suffer if a single feature
vector requires more storage space than a disk page can hold; in this case, the tree
will have a fanout of 1, reducing to a linked list.
Similar problems with high dimensionality have been reported for methods that
focus mainly on nearest-neighbor queries: Voronoi diagrams do not work at all for
dimensionalities higher than 3 (Aurenhammer, 1991). The method of Friedman et
al. (1975) does almost as much work as linear scanning for dimensionalities > 9.
The spiral search method of Bentley et al. (1980) also has a complexity that grows
exponentially with the dimensionality.
Relevant to our work is a wide variety of clustering algorithms (e.g., Hartigan,
1975; Salton and Wong, 1978; Murtagh, 1983, for surveys). However, the main goal
of these algorithms is to detect patterns in the data, and/or to assess the quality
o f the clustering scheme using the precision and recall measures; there is usually
little attention to measures like the space overhead and the time required to create,
search, and update the structure.

3. Intuition Behind the Proposed Method

As mentioned, several of the target applications require indexing in a high-dimensional
feature space. Current spatial access methods suffer from the dimensionality curse
(i.e., exploding exponentially with the dimensionalilty).
The solution we propose is to contract and extend the feature vectors dynamically,
that is, to use as few of the features as necessary to discriminate among the objects.
This agrees with the intuitive way that humans classify objects: for example, in
zoology, the species are grouped in a few broad classes first, using a few features
(e.g., vertebrates versus invertebrates). As the classification is further refined, more
and more features are gradually used (e.g., warm-blooded versus cold-blooded, or
lungs versus gills).
The basis of our proposed TV-tree is to use dynamically contracting and extending
feature vectors. Like any other tree, it organizes the data in a hierarchical structure:
Objects (i.e., feature vectors) are clustered into leaf nodes of the tree, and the
description of their Minimum Bounding Region (MI3R) is stored in the parent node.
Parent nodes are recursively grouped too, until the root is formed.
Compared to a tree that uses a fixed number of features, our tree provides
a higher fanout at the top levels, using only a few, basic features, as opposed to
many, possibly irrelevant, features.
As more objects are inserted into the tree, more features might be needed to
discriminate among the objects. At that time, new features are introduced. The
key point here is that features are introduced on a "when needed" basis and, thus,
we can soften the effect of the dimensionality curse.
VLDB Journal 3 (4) Lin: The TV-Tree 521

The basic telescopic vector concept can be applied to a tree with nodes that
describe bounding regions of any shape (cubes, spheres, rectangles, etc.). Also, there
is flexibility in the choice of the telescoping function, which selects the features of
interest at any level of the tree. We discuss these design choices in the next two
subsections.

3.1 Telescoping Function

In general, the telescoping problem can be described as follows. Given an n x 1

feature vector ~ and an m x n (m < n) contraction matrixAm, the m x 1 vector
A m ~ is an m-contraction of ~. A sequence of such matrices Am, with m = 1, . . .
describes a telescoping function provided that the following condition is satisfied:
If the ml-contractions of two vectors, ~ and if, are equal, then so are their respective
m2-contractions, for every m 2 ~ ml.
While a variety of telescoping functions can be defined (Appendix B), the most
natural choice is simple truncation. That is, each matrix Am has a 1 in positions
(1,1) through (m, m), along a diagonal, and 0 everywhere else. In this article, we
assume that truncation is the telescoping function selected.
The proposed method treats the features asymmetrically, favoring the first few
features over the rest, when truncation is used as the telescoping function. For
similarity queries, which are likely to be frequent in the application domains we
have in mind, it is intuitive that well ordered features will result in a more focused
search. Even for exact match queries, where the depth of the tree typically will not
be enough to have considered all features, a good choice of order will improve the
response time of our method. Notice, however, that the c o r r e c t n e s s is not affected;
poor ordering may make our method examine many false alarms, and thus do more
work, but it will never create false dismissals.
In most applications, transforming the given feature vector will achieve good
ordering. Ordering the features on the basis of importance is exactly what the
Karhunen Lowe (KL) transform achieves (Fukunaga, 1990): Given a set of n vectors
with d features each, it returns d new features, which are linear combinations of
the old ones, and which are sorted in discriminatory power. Figure 1 gives a 2-D
example, where the vectors kl and k2 are the results of the KL transform on the
illustrated set of points.
The KL transform is optimal if the set of data is known in advance (i.e., the
transform is data-dependent). Sets of data with rare or no updates appear in real
applications: for example, databases that are published on CD-ROM, dictionaries,
or files with customer mailing lists that are updated in batches. The KL transform
will also work well if a large sample of data is available in advance, and if the new
data have the same statistical characteristics as the old ones.
In a completely dynamic case, we have to resort to data-independent transforms,
such as the Discrete Cosine Transform (DCT; Wallace, 1991), the Discrete Fourier
Transform (DFT), the Hadamard Transform (Hamming, 1977), and the Wavelet
522

Figure 1. Illustration of the Karhunen Lowe transform

felture'2 Xx kl

k2 x x
I?x K
X
x
xxVx
X X

" xwx...
,jJ -'" x-
X

x x
x x feature 1
x xX
X

Transform (Ruskai et al., 1992). Fortunately, many data-independent transforms will

perform as well as the KL if the data follow specific statistical models. For example,
the DCT is an excellent choice if the features arc; highly correlated. This is the
case in 2-D images, where nearby pixels have very similar colors. The JPEG image
compression standard (Wallace, 1991) exactly exploits this phenomenon, effectively
ignoring the high-frequency components of the DCT. Since the retained components
carry most of the information, the JPEG standard achieves good compression with
negligible loss of image quality.
We have observed similar behavior for the DFT in time series (Agrawal et
al., 1993). For example, random walks (also known as brown noise or brownian
walks) exhibit a skewed spectrum, with the lower-fi:equency components being the
strongest (and, therefore, most important for indexing). Specifically, the amplitude
spectrum is approximately O(f-1), where f is the frequency). Stock movements
and exchange rates have been successfully modeled as random walks (Mandelbrot,
1977; Chatfield, 1984). Birkhoff's theory (Schroeder, 1991) claims that "interesting"
signals, such as musical scores and other works of art, consist of pink noise, whose
spectrum is similarly skewed (0(/--05)).
In general, if the ~statistical properties of the data are well understood, a data-
independent transform in many common situations will obtain near optimal results,
producing features sorted on the order of importance. We should stress again that
the use of a transform is orthogonal to the TV-tree--a suitable transform will just
accelerate the retrieval.

3.2 Shape of Bounding Region

As mentioned earlier, points are grouped together', and their minimum bounding
region (MBR) is stored in the parent node. The shape of the MBR can be chosen
to fit the application; it may be a (hyper-)rectangle:, cube, sphere etc. The simplest
VLDB Journal 3 (4) Lin: The TV-Tree 523

shape to represent is the sphere, requiring only the center and a radius. A sphere
of radius r is the set of points with Euclidean distance < r from the center of the
sphere. Note that the Euclidean distance is a special case of the Lp metrics, with
p=2:

Lp(Z, ff) = [ E ( x i - yi)P] lip (1)

i
For the L1 metric (Manhattan, or city-block distance), the equivalent of a sphere
is a diamond shape; for the Loo metric, the equivalent shape is a cube.

Definition. The Lp-sphere of center c' and radius r is the set of points whose Lp
distance from the center is < r.

The up-coming algorithms for the TV-tree will work with any Lp-sphere, without
any modifications to the TV-tree manipulation algorithms. The only algorithm that
depends on the chosen shape is the algorithm that computes the MBR of a set of
data. The algorithm for the diamond shape is presented in Appendix A.
Minor modifications are required in the TV-tree algorithms to accommodate
other popular shapes, such as rectangles or ellipses. Compared to Lp-spheres, these
shapes differ only in that they have a different radius for each dimension. The
required changes in the TV-tree algorithms are in the decision-making steps, such as
the criteria for choosing where to split, or which branch to traverse during insertion.
For the rest of this article, we concentrate on Lp-spheres as MBRs.

4. The TV-tree

4.1 Node Structure

Each node in the TV-tree represents the MBR (an Lp-sphere) of all of its descendents.
Each region is represented by a center, which is a vector determined by the telescoping
vectors representing the objects, and a scalar radius. We also call the center of the
region a telescopic vector (in the sense that it also contracts and extends depending
on the objects stored within the region). We use the term TelescopicMinimum
Bounding Region (TMBR) to denote an MBR with such a telescopic vector as a
center.
Definition. A telescopic Lp-sphere with center ~' and radius r, with dimensionality d
and with c~ active dimensions contains the set of points ff such that

ci =Yi i : 1 , . . . , d - - o~ (2)

and
d
rP --> E (ci -- yi) p (3)
i=d-a+l
524

Figure 2. Example of TMBRs (diamonds, spheres) with different o~

.-~ " 4

""i 7 2 7 2 7
DI : Cemer (2) Radius I DI: C.cmter (2,6) Radius 2 S I: Center (2,6) Radius2
D2 : Cemer (7, 6) Radius 2 D2: Center (7,4) Radius I $2: Center (7,4) Radius I
(a) (b) (c)

Number of active d i m e n s i o n . , = I Number of active dimensions = :Z Number of active dimension., = 2

: Denotes extend indefinitely along the direction

• Center

In Figure 2a, D2 has 1 inactive dimension (the first one), and i active dimension
(the second one). D1 also has one active dimension (the first one). The dimension-
ality of D1 is 1 (only the first dimension has been taken into account in specifying
D1) and the dimensionality of D2 is 2 (both dimensions have been considered).
We need this concept because, as the tree grows., some leaf node will eventually
consist of points that all agree on their first, say, k dimensions. In this case, the
T M B R should exploit this fact; its first k dimensions are inactive dimensions, in the
sense that these dimensions cannot distinguish between the node's descendents.
In our presentation, the active dimensions are always the last ones. Moreover, we
can control the number of active dimensions o~ and ensure that all the TMBRs in
the tree have the same ce. This number is a design parameter of the TV-tree.
Definition. The number of active dimensions (o~) of a TV-tree is the (common) number
of active dimensions of all its TMBRs.
The notation TV-1 denotes a TV-tree with o~=1; Figure 2 shows the TMBRs
of TV-1 and TV-2 trees. The discriminatory power of the tree is determined by o~.
Whenever more discriminatory power is needed, new dimensions are introduced to
ensure that the number of active dimensions remains the same.
The data structure for a TMBR is as follows:

struct TMBR { TVECTOR v;

integer radius;}
struct T V E C T O R { list_of (float feature_value);
integer no_of_dimensions;}

where T V E C T O R stands for telescopic vector.

VLDB Journal 3 (4) Lin: The TV-Tree 525

Figure 3. Example of a W-1 tree (with diamonds)

F
Dl

4.2 Tree Structure

The W-tree structure bears some similarity to the R-tree. Each node contains a set
of branches; each branch is represented by a TMBR denoting the space it covers;
all descendants of that branch will be contained within that TMBR; TMBRs are
allowed to overlap; and each node occupies exactly one disk page.
Examples of TV-l and TV-2 trees are given in Figures 3 and 4. Points A
through I denote data points (only the first two dimensions are shown).
In the TV-l tree, the number of active dimensions is 1, thus the diamonds
extend only along 1 dimension at any time. As a result, the shapes are straight lines
or rectangular blocks (extended infinitely). In the TV-2 case, the TMBR resembles
two dimensional &,-circles.
At each stage, the number of active dimensions is exactly as specified. Sometimes,
more than one level of the tree may using the same active dimensions. Figure 4 is
an example; the same pair of active dimensions is used at both levels of the tree
shown. More commonly, new active dimensions are used at each level. This is the
case in Figure 3 when D3 has to be split any further.
4.3 Algorithms

Search. For both exact and range queries, the algorithm starts with the root and
examines each branch that intersects the search region, recursively following these
branches. Multiple branches may be traversed because TMBRs are allowed to
overlap. The algorithm is straightforward and the pseudo-code is omitted for
brevity.
526

Figure 4. Example of a TV-2 tree (with sphelres)

ssI [ $$2

I I I

Spatial join can be handled as well. Recall that such a query requires all pairs
of points that are close to each other (i.e., closer than a tolerance Q. Again,
a recursive algorithm that prunes out remote branches of the tree can be used;
efficient improvements on this algorithm have recently appeared (Brinkhoff et al.,
1993).
Similarly, nearest-neighbor queries can be handled with a branch-and-bound
algorithm (Fukunaga and Narendra, 1975). The algorithm works as follows: given
a (query)(query) point, examine the top-level branches, and compute upper and
lower bounds for the distance; descend the most promising branch, disregarding
branches that are too far away.
Insertion. To insert a new object, we traverse the tree, choosing the branch at each
stage that seems most suitable to hold the new object. Once we reach the leaf
level, we insert the object in the leaf. Overflow is handled by splitting the node, or
by re-inserting some of its contents. After the insertion/split/re-insert, we update
the TMBRs of the affected nodes along the path. For example, we may have to
increase the radius of a TMBR or decrease its dimensionality (i.e., contract the
telescopic vector of the center), to accommodate the new object (Figure 5).
The routine PickBranch(Node N, element e) examines the branches of the node
N and returns the branch that is most suitable to accommodate the element (point
or T M B R ) e to be inserted. In choosing a branch, we use the following criteria,
in descending priority:
1. Minimum increase in overlapping regions within the node (i.e., choose the
T M B R such that after update, the number of new pairs of overlapping T M B R
is minimized within the node introduced; Figure 6a).
2. Minimum decrease in dimensionality (i.e., choose the T M B R with which the
new object can agree on as many coordinates as possible, so that it can
VLDB Journal 3 (4) Lin: The TV-Tree 527

Figure 5. Decrease in dimensionality during insertion

(D o°

f
DI OP
oi • • ,0)

,i
3 5 ,¢'
,¢ w

3 5
Di : Center (3. 5). radiu.q4
Di : Center (4). raditn; 1

accommodate the new object by contracting its center as little as possible.

For example, in Figure 6b, R1 is chosen to avoid contracting R2.
3. Minimum increase in radius (Figure 6c).
4. Minimum distance from the center of the T M B R to the point (in case the
previous two criteria tie; Figure 6d).

Handling overflowing nodes is another important aspect of the insertion algo-

rithm. Here an overflow can be caused not only by an insertion into a full node
but by an attempt to extend a telescopic vector as well. Splitting the node is the
most obvious way to handle overflow. However, reinsertion can also be applied,
selecting certain items to be reinserted from the top. This provides a chance to
discard dissimilar items from a node, usually achieving better clustering.
In our implementation we have chosen the following scheme to handle overflow,
treating the leaf node and the internal node differently:
• For a leaf node, a pre-determined percentage (Pri) of the leaf contents will
be reinserted if it is the first time a leaf node overflows during the current
insertion. Otherwise, the leaf node is split in two. Once again, different
policies can be used to choose the elements to be reinserted. H e r e we choose
those that are farthest away from the center of the region.
• For an internal node, the node is always split; the split may propagate
upwards.
528

Figure 6. Illustration of choose-branch criteria

N c w R3 ....,£....-"New RI

• o ....... ...
.. . . . . . ..............
°°°°°°°•
• • :F, ',

°.

RI." R2
°

• ••°•°°'

(a) (b)
R1 is selected because extending R1 is selected over R2 beacuse
R2 or R3 will lead to a new pair selecting R2 will result in a
of overlapping regions decrease in dimensionality of R2

(c) (d)
RI is selected over R2 because RI is selected over R2 because
the resulting region will have a R1 's center is closer to the point
smaller radius to be inserted
VLDB Journal 3 (4) Lin: The TV-Tree 529

Algorithm 1. Insert algorithm.

begin
/* Insert element e into tree rooted at N */
Proc Insert(Node N, element e)
1. Use PickBranch 0 to choose the best branch to follow; descend the tree until
the leaf node L is reached.
2. Insert the element into the leaf node L.
3. If leaf L overflows
If it is the first time during insertion
Choose the Pri elements farthest away from the center of L and re-insert
them from the top.
else
Split the leaf into two leaves.
4. Update the TMBRs that have been changed (because of insertion and/or
splitting).
Split an internal node if overflow occurs.
end

Splitting. The goal of splitting is to redistribute the set of TMBRs (or vectors, when
leaves are split) into two groups to facilitate future operations and provide high
space utilization. There are several ways to do the split. One way is to use a
clustering technique that groups vectors so that similar ones will reside in the same
TMBR.
Algorithm 2. Splitting by clustering
begin
/* assume N is an internal node; similar for leaf nodes */
Proc Split(Node N, Branch NewBranch, float rain_percent)
1. Pick as seeds the branches B1 and B2 with the two most dissimilar TMBRs
(i.e., the two with the smallest common prefix in their centers; on tie, pick
the pair with the largest distance between their centers). Let R1 and R2 be
the groups headed by B1 and B2, respectively.
2. For each of the remaining branches B:
Add B to that group R1 or R2 according to the PickBranchO function
end
Another way of doing the split is by ordering. The vectors (i.e., the centers of
the TMBRs) are ordered in some way and the best partition along the ordering is
found. The current criteria being used are (in descending priority):
1. Minimum sum of radius of the two TMBRs formed
2. Minimum of (sum of radius of TMBRs -- Distance between their centers)

In other words, we first try to minimize the area that the TMBRs cover; and
then minimize the overlap between the diamonds.
530

Ordering can be done in a few different ways. We have implemented one that
sorts the vectors lexicographically. Other orderings, such as a form of space-filling
curves (e.g., the Hilbert curve; Kamel and Faloutsos, 1993) can also be used.

Algorithm 3. Splitting by ordering

begin
/* assume N is an internal node; similar for leaf nodes */
/* min_fill is the minimum percentage (in bytes) of the node to be occupied */
Proc Split(Node N, Branch NewBranch, float min_fiU)
1. Sort the TMBRs of the branches by ascending row-major order of their
centers.
2. Find the best break-point in the ordering, to create two sub-sets: (a) ignore
the case where one of the subsets is too smallt ( < min_fill bytes); (b) among
the remaining cases, choose the break-point such that the sum of the radius
of the TMBRs of the two sets is the smallest. Break ties by minimum (sum
of radius of TMBRs -- distance between the centers).
3. If requirement (a) above leaves no candidates, then sort the branches by
their byte size and repeat the above step, skipping step (a), of course.
end

The last step in the algorithm guards against the rare case where one of the
TMBRs has a long vector for center, while the rest have short vectors. In this case,
a seemingly good split might leave one of the two new nodes highly under-utilized.
The last step makes sure that the new nodes have similar sizes (byte-wise).

Deletion. Deletion is straightforward, unless it causes an underflow. In this case,

the remaining branches of the node are deleted and re-inserted. The underflow
may propagate upwards.

Extending and Contracting. As previously mentioned, extending and contracting of

TVECTORs are important aspects of the algorithm. Extending is done at the
time of split and reinsertion. When the objects inside a node are redistributed
(either by splitting into two or removing at reinsertion), it may be the case that the
remaining objects have the same values in the first few (or all) active dimensions.
Thus, during the recalculation of the new TMBR, extension will occur (i.e., new
active dimensions will be introduced and those on which all the objects agree will
be rendered inactive).
An example of extending diamonds is given in Figure 7. After extension, the
diamond extends only along the y-dimension.
On the other hand, contraction occurs during insertion. When an object is
inserted into a TMBR such that the inactive dimensions of the TMBR do not agree
completely with those of the object, the new TMBR will have some dimensions
contracted, resulting in a TMBR with lower dimensionality.
VLDB Journal 3 (4) Lin: The TV-Tree 531

Figure 7. Extending a TMBR (diamond), with ~ = 1

20 20 "~

Befon~ Extending After Extending

Center (3), Radius 0 Center (3. 10), Radius 10

5. Experimental Results
W e implemented the TV-tree as described above, in C + + under UNIX, 1 and we
ran several experiments. The experiments form two sets: In the first, we tried to
determine what is a good value for the number of active dimensions (o~) for the
TV-tree; in the second set we compared the proposed method with the R*-tree,
which we believe is the fastest known variation of R-trees.

5.1 Experimental Setup

The test database was a collection of objects of fixed size, using dictionary words
from / u s r / d i c t / w o r d s as keys. To find the closest matches in the presence of
typing errors, the queries were exact match and range queries. For features, we
used the letter count for each word, ignoring the case of the letters. Thus, each
word is mapped to a vector v with 27 dimensions, one for each English alphabet
letter, and an extra one for the non-alphabetic characters. The L 1 distance among
two such vectors is a good measure for the edit distance; for this reason, we have
used Ll-spheres (diamonds) as our bounding shapes.
Finally, we apply the Hadamard Transform. 2 For n = 2 k, the Hadamard Trans-
form matrix is defined as follows:

1. U N I X is a registered Trademark of Novell, Inc.

2. Actually, we are using the 32-dimension H a d a m a r d Transform matrix (Hamming, 1977) and padding
extra 0s to the feature vectors.
532

1 1 , nk+l =
H1 = 1 -1 Hk -- Hk

on these letter-count vectors, appropriately zero-padded. The Hadamard Transform

is used to give each letter a more even weight, especially in the first few dimensions.
The TV-trees in the experiment used the algorithms described in the last section,
with forced re-insertion, and with the ordering method for splitting. We used rain_fill
= 45% and the percentage of elements to be reinserted to be Pri = 30%. These
numbers are comparable to the parameter for the optimal R*-tree parameters. This
number was chosen in order to provide a fair comparison for insertion behavior.
Experiments on 2,000 to 16,000 words were run, with words being randomly
drawn from the dictionary. We varied several parameters, such as the number of
active dimensions oL (from 1 to 4), and the tolerance c of the range query, from e
= 0 (exact match) up to 2.
For the exact match queries, we tried successful searches (i.e., the query word
was found in the dictionary), using half of the database words as query points.
Experiments with unsuccessful searches gave similar results and are omitted. We
also issued range queries with the words randomly drawn from the dictionary, (the
number of queries is half of the database words).
We measured both the number of disk accesses (assuming that the root is in
core), as well as the number of leaf accesses. The former measure corresponds to
an environment with limited buffer space; the latter approximates an environment
with enough buffer space that, except for the leaves, llhe rest of the tree fits in core.

5.2 Results

Analysis for the Number of Active Dimensions. The first set of experiments tried to
determine a good value for ol. Different numbers of active dimensions of the
TV-tree were tried. The results are shown in Figures 8 through 10. The page size
was 4K bytes and objects of size 100 bytes are used.
We also measured the total number of pages accessed, assuming that the whole
tree (except the root) was stored on the disk and no buffer for the internal levels
was available. The results are similar.
The results indicate that ce = 2 gives the best results, because the TV-2 tree
outperforms the rest. This can be interpreted as an optimization of two conflicting
factors: tree size and number of false drops. With a smaller o~, fewer dimensions
will be available to differentiate among the entries, thus more branches will have
to be searched. However, a larger ce will lead to a decrease of fanout per node,
making it necessary for more branches to be retrieved when the search space is
large. Moreover, effectively clustering objects in higher dimensions is also more
difficult, given the constraints in shapes allowed. (In l-D, one can always sort the
numbers and order it; but this method breaks down in higher dimensions). In the
experiments we ran, ce = 2 is the best compromise.
VLDB Journal 3 (4) Lin: The TV-Tree 533

Figure 8. Exact match queries (# leaf accesses vs. o~)

140 2000 words --~ ~ •

4000 words -+--- ..-.--'"" .................
64000 words ,o-° ,i,........ "
8000 words .-x-....
120 10000 words -,¢,--
...... ~ .................................. x

/ ; •"/" . . . . ~3" . . . . . . . . . . . . . . . . . . E3
80 / .~-.- ...........
.i / ...."
~ f
~ !/ - , ÷
!.." , °-
//.." f"
;/ o,-"
40 i // / ,-.." , .+-" = ..........,..........,-,4>
i - ,. /
//,-" /
.
/~/i./ / "
20 ~..._~ .#... ,/

0 I I I I f i
1 2 3 4 5 6
Number o! active dimensions

Figure 9. Range queries (tolerance=l)(# of leaf accesses vs. o~)

2OO

....................... -.,iL
2000 words /"
180 4000 words -+---- /-
6000 words -e-- I"/
8000 words .~ ..... ,Y
160 10000 words -=P--

/ ..f..'J
140
/ j"
/ x
120
/i / / ..,e- .......... . . . . . . . . . 0 .................. "E]
e :, .
i/ // ° .o....-"'"
=: 100
/
/ /
f" .,
la-

! i .." .~ ............... ~ ............... -+

80
i /
/ /

60
B, "--,--.. / ." I/

0 i I I I I
1 2 3 4 5
Number of active dimensions
534

Figure 10. Range queries (tolerance=2)(# of disk accesses vs. o~)

240 2000 words --0-- ........-"" ..........

4000 words - ~ - . - ...... '~
6000 words -o-- .,J " "
220 8000 words .-x..... i'
10000 words -,6-- ,
.i ...)@...........
200
l.I ........
--.."~..................... I~.............
.t ~ ..................................... X
.t .j
180 .i ...."
.
i ...
.i /
160 i .
i" .."
ii // ............. .-.E} .................. 0 .................. "0
¢1
140 i" D"*
O.
.......... ,/ ./ / ....,
- ---.-.& /. ."
== 120 ./
U
¢0 ..,"" .÷..°°...°.°-''°'~ ..............................
100
.J
0 . . . .. . ... . .. . . .. . . .. 0""
*"
..1
80

I I I I I I
0
1 2 3 4 5 6
Number of active dimensions

Table 1. Disk access per insertion - object size 100 bytes

Dictionary size Disk access per insertion

R*-tree TV-2 tree
4,000 5.25 4.75
8,000 5.51 5.21
12,000 6.19 5.28
16,000 6.50 5.35

5.3 Comparison with R*-Tree

lndex Creation. We measured the number of disk accesses (read + write) needed to
build the indexes. We assumed that every update of the index would be reflected
on the disk. We found that, in general, the insertion cost is cheaper in the W-tree.
This is due to the fact that the W-tree is usually shallower than the corresponding
R*-tree and, thus, fewer nodes need to be retrieved and fewer potential updates
need to be written back to disk. Table 1 shows the result for object size 100 bytes
with a 4K page size.
VLDB Journal 3 (4) Lin: The TV-Tree 535

F i g u r e 11. D i s k / l e a f a c c e s s e s vs. d b size - e x a c t m a t c h q u e r i e s

100 , , , , , , , ,

R*-tree: Disk access -e----

R'-tree: Leaf access --÷. . . .
T V - 2 tree: Disk access --B---
80 T V - 2 tree: Leaf access -~--

J~
L~
~o 60
P,
¢x

¢D
8
t~
40
Q
~ + ...-.+

20 ..~-"~" o .... B .......

..... -.......
_ . ~ . . . . . : - .°.o. -.-.B~- . .. ....... ..... .. .. . . . . . . ..~x+"
. ........ x
* ..

I I I I I I I I
2000 4000 6000 8000 10000 12000 14000 16000
Dictionary size

The big jump between 4,000 and 8,000 for the TV-2 tree is because of an
introduced addition level. However, the TV-2 tree still has one level fewer than
the R*-tree. Thus, the increase in disk access for the TV-2 tree is slower after the
introduced level.

Search. The next set of experiments compared the proposed TV-tree with the
R*-tree. Figures 11 through 13 show the number of disk/leaf accesses as a function
of the database size (number of records). The number of leaf accesses is the lower
curve in each set. A 4,000 page size was used. The following results are for objects
of size 100 bytes.
As seen from the figures, the TV-2 tree consistently outperforms the R*-tree,
with up to 67-73% savings in total disk accesses for exact matches and similar
savings in leaf accesses. The savings for range queries are also high (,~., 40% for
large dictionary size).
Moreover, the savings increased with the size of the database, indicating that
our proposed method scales up well. As the database size increased from 2,000 to
16,000 elements, the savings in the number of leaf accesses increased consistently:
from 67% to 73% for exact match queries; from 50% to 58% for range queries
with tolerance e = l ; and from 33% to 42% for range queries with e=2.
536

Figure 12. Disk/leaf accesses vs. db size-range queries (tolerance = 1)

i i | , ,

2O0 R'-tree: Disk access -e--

R'-tree: Lead access + ....
TV-2 tree: Oisk access -¢--
TV-2tree: Leaf access -N-- / ..+

.,.., j / " "

160
,,e .....

!
120
i
.=

Q 80 /~ o°.o o .......

"~~ii:::::::....:::~..
" .../
.~"
' ..o
o---°"
o" ..... ....
. °x-" *
"......::..~
...-/~" .._.::::.~:: ........
40

I I | I I I I I
2000 4000 6000 8000 10000 12000 14000 16000
Dictionary size

Figure 13. Disk/leaf accesses vs. db size-range queries (tolerance=2)

, | i | i ! i i
350
R'-tree: Disk access ~ /
R*-tree: Leaf access -.+.... /
TV-2 tree: Disk access .-e--- /
TV-2 tree: Leaf access -N.. / ////4-
3O0

250

w
20O ~ . i " ..... ...::~
i t" -o::::'""
i ~ o..:~-"
~,,......... .~.:::::.:-'"
150
0 ..i. .'~ ..~:::.'"

100

I I I I I I I I
2000 4000 6000 8000 10000 12000 14000 16000
Dictionary size
VLDB Journal 3 (4) Lin: The TV-Tree 537

Figure 14. Comparison of space requirements

800

R'-trees - ~
TV-2 tree -+---
700

600

500

400
.D

Z
300

200

100

I I I I I I I I
2000 4000 6000 8000 10000 12000 14000 16000
Dictionary size

Even if we only assume that the leaves are stored in the disk (while all the non-
leaf levels are read into memory buffer beforehand), the TV-2 tree still outperforms
the R*-tree significantly (around 60-70% for exact match and 25-35% for range
queries with c=2).
We also experimented with various sizes of database objects. Our method showed
more significant improvement when object size is small. As object size increases,
the leaf fan-out decreases, making the TV-tree grow faster, and offsetting some of
its advantages. However, even with object size 200, we still have improvement of
around 60% over R*-trees for exact match and 40% for range queries with c=2.

Comparisonof SpaceRequirements. Figure 14 shows the number of nodes (= pages)

in the trees. The TV-tree requires fewer number of nodes (and thus less space).
The savings are 15-20%.
Since the object size is the same for both indexes, the number of leaf nodes are
also very similar (in fact, they will be identical when the utilization is the same).
This implies that all the savings in the TV-tree are from internal nodes, which
means that the non-leaf levels require a smaller buffer, which can be significant
when buffer space is limited.
538

6. Conclusions
In this article, we proposed the TV-tree as a method[ for indexing high dimensional
objects. The benefit lies in its ability to adapt dynamically and use a variable number
of dimensions to distinguish between objects or groups of objects. Since this number
of required dimensions is usually small, the method saves space and leads to a larger
fan-out. As a result, the tree is more compact and :shallower, requiring fewer disk
accesses.
We presented the manipulation algorithms in detail, as well as guidelines for
choosing the design parameters (e.g., optimal actiwe dimension o~ = 2, minimum
fill factor = 45%). We implemented the method, and we reported performance
experiments, comparing our method to the R*-tree. The W-tree achieved access
cost savings of up to 80%, at the same time resuhing in a reduction in the size
of the tree, and hence its storage cost. Moreover, the savings seem to increase
with the size of the database, indicating that our method will scale well. In short,
we believe that the W-tree should be the method of choice for high dimensional
indexing.

Acknowledgments
This research was partially funded by the Institute for Systems Research, and by
the National Science Foundation under grants IRI-9205273 and IRI-8958546 (PYI),
with matching funds from EMPRESS Software, Inc. and Thinking Machines, Inc.
The authors thank Alexios Delis and Ibrahim M. Kamel for their help.

References
Agrawal, R., Faloutsos, C., and Swami, A. EfficienlL similarity search in sequence
databases. FODO Conference, Evanston, IL, 1993.
Altschul, S.E, Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. A basic local
alignment tool. Journal of Molecular Biology, 215(13):403-410, 1990.
Angell, R.C., Freund, G.E., and Willet, P. Automatic spelling correction using a
trigram similarity measure. Information Processing and Management, 19(4):255-
261, 1983.
Arya, M., Cody, W., Faloutsos, C., Richardson, J., and Toga, A. Qbism: A prototype
3-D medical image database system. IEEEData Engineering Bulletin, 16(1):38-42,
1993.
Aurenhammer, E Voronoi diagrams: A survey of .a fundamental geometric data
structure. A C M Computing Surveys, 23(3):345-405, 1991.
Beckmann, N., Kriegel, H.-P., Schneider, R., and Seeger, B. The R*-tree: An
efficient and robust access method for points and rectangles. A C M SIGMOD,
Atlantic City, NJ, 1990.
Bentley, J.L., Weide, B.W., and Yao, A.C. Optimal expected-time algorithms for
closest-point problems. A C M Transactions on Mathematical Software, 6(4):563-
580. 1980.
VLDB Journal 3 (4) Lin: The TV-Tree 539

Brinkhoff, T., Kriegel, H.-P., and Seeger, B. Efficient processing of spatial joins
using R-trees. Proceedings of the ACM SIGMOD, Washington, DC, 1993.
Chatfield, C. The Analysis of Time Series: An Introduction. London: Chapman and
Hall, 1984. Third edition.
Friedman, J.H., Baskett, E, and Shustek, L.H. An algorithm for finding nearest
neighbors. IEEE Transactions on Computers, C-24(10):1000-1006, 1975.
Fukunaga, K. Introduction to Statistical Pattern Recognition. New York: Academic
Press, 1990.
Fukunaga, K. and Narendra, P.M. A branch and bound algorithm for computing
k-nearest neighbors. IEEE Transactions on Computers, C-24(7):750-753, 1975.
Greene, D. An implementation and performance analysis of spatial data access
methods. Proceedings of Data Engineering, Boston, MA, 1989.
Guttman, A. R-trees: A dynamic index structure for spatial searching. Proceedings
of the ACM SIGMOD, 1984.
Hamming, R.W DigitalFilters. Englewood Cliffs, NJ: Prentice-Hall, 1977.
Hartigan, J.A. Clustering algorithms. New York: John Wiley & Sons, 1975.
Hoel, E.G. and Samet, H. A qualitative comparison study of data structures for
large line segment databases. Proceedings of the ACM SIGMOD Conference, San
Diego, CA, 1992.
Hunter, G.M. and Steiglitz, K. Operations on images using quad trees. IEEE
Transactions on PAMI, 1(2):145-153 (1979).
Jagadish, H.V. Spatial search with polyhedra. Proceedings of the Sixth IEEE Interna-
tional Conference on Data Engineering, Los Angeles, CA, 1990.
Jagadish, H.V. A retrieval technique for similar shapes. Proceedings of theACM
SIGMOD Conference, Denver, CO, 1991.
Kamel, I. and Faloutsos, C. Hilbert R-tree: An improved R-tree using fractals.
Systems Research Center (SRC) TR-93-19, University of Maryland, College
Park, MD, 1993.
Kukich, K. Techniques for automatically correcting words in text. ACM Computing
Surveys, 24(4):377-440, 1992.
Mandelbrot, B. Fractal Geometry of Nature. New York: W.H. Freeman, 1977.
Murtagh, E A survey of recent advances in hierarchical clustering algorithms. The
Computer Journal, 26(4):354-359, 1983.
Narasimhalu, A.D. and Christodoulakis, S. Multimedia information systems: The
unfolding of a reality. IEEE Computer, 24(10):6-8, 1991.
Niblack, W, Barber, R., Equitz, W, Flickner, M., Glasman, E., Petkovic, D., Yanker,
P., Faloutsos, C., and Taubin, G. The qbic project: Querying images by content
using color, texture, and shape. SPIE 1993 International Symposium on Electronic
Imaging: Science and Technology Conference 1908, Storage and Retrieval for Image
and Video Databases, San Jose, CA, 1993. Also available as IBM Research Report
RJ 9203 (81511), 1993.
Nievergelt, J., Hinterberger, H., and Sevcik, K.C. The grid file: An adaptable,
symmetric, multikey file structure. ACM TODS, 9(1):38-71, 1984.
540

Orenstein, J.A. and Manola, EA. Probe spatial data modeling and query processing
in an image database application. IEEE Transactions on Software Engineering;
14(5):611-629, 1988.
Ruskai, M.B., Beylkin, G., Coifman, R., Daubech,ies, I., Mallat, S., Meyer, Y.,
and Raphael, L. Wavelets and Their Applications. Boston: Jones and Bartlett
Publishers, 1992.
Salton, G. and Wong, A. Generation and search of clustered files. A C M TODS,
3(4):321-346, 1978.
Samet, H. The Design andAnalysis of Spatial Data Structures. Reading, MA: Addison-
Wesley, 1989.
Schroeder, M. Fractals, Chaos, Power Laws: Minutes From an lnfinite Paradise. New
York: W.H. Freeman and Company, 1991.
Wallace, G.K. The jpeg still picture compression :standard. CACM, 34(4):31-44,
1991.

Appendix

A. Calculation of the Telescopic Minimum Bounding Diamond (TMBD)

To find the TMBD of a given set of points or diamonds, we first find the largest
m such that all the TVECTORS (centers of the diamond or vectors corresponding to
data points) agree in the first m dimensions. Then we project the next ce dimensions,
where ce is the number of active dimensions of the W-tree. Thus, the projected
diamonds will reside in a ol-dimensional space. An example is given in Table 2,
assuming the diamonds are from a TV-2 tree.
In Table 2, m is 2 (and o~ is 2 by definition of the TV-2 tree). Note that the
projected second diamond has a radius of 0 because the third and fourth dimensions
are not active dimensions. This means that all points inside the diamond will have
coordinates that start with (1,0,8,7,...).
From there we find the minimum bounding diamond of the projected diamonds,
and use its center as the active dimensions of the final MBD. The non-active di-
mensions will be the common m dimensions we first found. Finding the minimum
bounding diamond-of these projected diamonds can be formulated as a linear
programming problem. However, we decided to use a faster approximation algo-
rithm to find the approximate MBD. The algorithm first calculates the bounding
(hyper)rectangle of the projected diamonds, and then use its center as the dia-
mond center. The smallest radius that is needed to cover all the diamonds is then
calculated.
VLDB Journal 3 (4) Lin: The TV-Tree 541

Table 2. Example of Diamond Projection in a TV-2 tree

Original diamond Projected diamond

Center Radius Center Radius
(1,0,3,4) 2 (3,4) 2
(1,0,8,7,5,6) 4 (8,7) 0
(1,0,2,6) 1.5 (2,6) 1.5

A l g o r i t h m 4. Finding the MBD

begin
/*oz is the number of active dimensions */
Proc TMBD(Array of Diamonds D, integer o0;
1. Find min, the minimum dimensionality among all diamonds in D.
2. Find the maximum m such that all the diamonds have the same first m
dimensions.
3. I f m + ce < m i n
Set Startproject ~-- m + 1
Set Startproject ~-- m i n - oz + 1/* special case when some diamonds have
small dimensionality. This step is to ensure that there will be & active
dimensions */
4. Project each diamond to dimensions Startproject . . . Startproject + ol - 1,
setting the radius to 0 if none of the projected dimension is active,
otherwise retain the original radius.
5. Find the minimum bounding rectangle of the projected diamonds. Let c
center.
6. Set center of the result diamond ~-- the m common dimensions of the
diamonds concatenated with c.
7. Find the minimum distance that is needed to contain all diamonds, and set
this as the radius.
end
Continuing the example from Table 2, the bounding rectangle for the projected
diamonds has boundary (0.5, 8) along the first dimension, and (2, 7.5) along the
second. Its center is (4.25, 4.75). The radius required to cover all three diamonds
is 6. Thus, the final TMBR has center (1, O, 4,25, 4.75) and radius 6.

B. Telescoping Without Truncation

Given a feature vector of length n, its contraction to length m is achieved through

multiplication by the matrixAm. Here we present an example of a simple, summation-
based, telescoping function that does not involve truncation. The required series
of matrices A m are:
542

If n < 2m, A m has a 1 in position (1,1), (2,2), . . . , (2m- n, 2 m - n), (2m- n + 1,

2m-n + 1),(2m-n + 2,2m-n + 1),(2m-n +3 2m-n + 2),(2m-n +4,2m
-n + 2), . . . , (n, m), and a 0 everywhere else.

In other words, the first 2m - n rows have a single 1 each on the diagonal, and
the remaining n - m rows have two ls each, in pairs, in a stretched out continuation
of the diagonal. Call this the halving step.
If n/4 < m < n/2, obtain the matrixAp, w h e r e p = ceiling(n~2), using the halving
step, and then apply the halving step once more to the p length vector to create
an rn length vector. A m is obtained as the product of the two matrices for each
application of the halving step.
Similarly, for any value of m, enough applications of the halving step produce
the required contraction. The contraction for m = 1 is simply the summation of
all elements, induced by a matrix Am, which is a vector of all l's.

The AI Wealth Creation Blueprint PDF
67% (3)
The AI Wealth Creation Blueprint PDF
50 pages
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
100% (8)
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
148 pages
How To Hack Atm
87% (15)
How To Hack Atm
1 page
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
88% (8)
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
56 pages
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
95% (20)
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
471 pages
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
81% (48)
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
708 pages
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
100% (10)
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
821 pages
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
100% (10)
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
708 pages
Dbms 100 MCQ PDF
84% (209)
Dbms 100 MCQ PDF
14 pages
MongoDB Sales Presentation
No ratings yet
MongoDB Sales Presentation
35 pages
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
100% (25)
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
306 pages
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
100% (24)
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
52 pages
Banana Pancakes - Ukulele Chord Chart
100% (1)
Banana Pancakes - Ukulele Chord Chart
2 pages
The Fabric of Reality
100% (1)
The Fabric of Reality
6 pages
Excel Power Query Course Notes: Leila Gharani (Microsoft Excel MVP)
100% (2)
Excel Power Query Course Notes: Leila Gharani (Microsoft Excel MVP)
43 pages
75 Productivity Hacks - System Sunday
100% (7)
75 Productivity Hacks - System Sunday
75 pages
Military Remote Viewing Manual
100% (5)
Military Remote Viewing Manual
72 pages
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
No ratings yet
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
20 pages
Machine Learning For Humans
100% (4)
Machine Learning For Humans
97 pages
Sap Abap Code
No ratings yet
Sap Abap Code
47 pages
The W-Tree: An Index Structure For High-Dimensional Data: King-Lp Lin, H.V. Jagadish, and Christos Faloutsos
No ratings yet
The W-Tree: An Index Structure For High-Dimensional Data: King-Lp Lin, H.V. Jagadish, and Christos Faloutsos
26 pages
Spatial, Text, and Multimedia Databases: Erik Zeitler Udbl
No ratings yet
Spatial, Text, and Multimedia Databases: Erik Zeitler Udbl
53 pages
Spatial Data Management
No ratings yet
Spatial Data Management
7 pages
Advanced Indexing Techniques: Bibliographical Notes
No ratings yet
Advanced Indexing Techniques: Bibliographical Notes
4 pages
Fast Nearest Neighbor Search With Keywords: Yufei Tao Cheng Sheng
No ratings yet
Fast Nearest Neighbor Search With Keywords: Yufei Tao Cheng Sheng
13 pages
3 - Efficient Data Access
No ratings yet
3 - Efficient Data Access
7 pages
GCUBE Indexing
No ratings yet
GCUBE Indexing
12 pages
Fast Nearest Neighbor Search With Keywords
No ratings yet
Fast Nearest Neighbor Search With Keywords
3 pages
A Survey of Recent Multidimensional Access Methods
No ratings yet
A Survey of Recent Multidimensional Access Methods
117 pages
Multi Dim Point Data
No ratings yet
Multi Dim Point Data
143 pages
[Lecture Notes in Computer Science 2341] Cui Yu (eds.) - High-Dimensional Indexing_ Transformational Approaches to High-Dimensional Range and Similarity Searches (2003, Springer-Verlag Berlin Heidelberg) - li.pdf
No ratings yet
[Lecture Notes in Computer Science 2341] Cui Yu (eds.) - High-Dimensional Indexing_ Transformational Approaches to High-Dimensional Range and Similarity Searches (2003, Springer-Verlag Berlin Heidelberg) - li.pdf
159 pages
78221000
No ratings yet
78221000
7 pages
Answering Metric Skyline Queries by PM-tree
No ratings yet
Answering Metric Skyline Queries by PM-tree
16 pages
Dynamic Spatial Approximation Trees For Massive Data: Abstract-Metric Space Searching Is An Emerging Technique
No ratings yet
Dynamic Spatial Approximation Trees For Massive Data: Abstract-Metric Space Searching Is An Emerging Technique
8 pages
Advanced Database Indexing
No ratings yet
Advanced Database Indexing
17 pages
Timos Sellis: The R - Tree: A Dynamic Index For Multi-Dimensional Objects
No ratings yet
Timos Sellis: The R - Tree: A Dynamic Index For Multi-Dimensional Objects
11 pages
Spatial Data Indexing and Queries
No ratings yet
Spatial Data Indexing and Queries
56 pages
Multidimensional Index Structures
No ratings yet
Multidimensional Index Structures
70 pages
Enhancing Data Retrieval Efficiency in Large-Scale Javascript Object Notation Datasets by Using Indexing Techniques
No ratings yet
Enhancing Data Retrieval Efficiency in Large-Scale Javascript Object Notation Datasets by Using Indexing Techniques
12 pages
Online Analytical Processing System Providing Spatial Information To The Data Warehouse by Using Geographical Cube Methodology
No ratings yet
Online Analytical Processing System Providing Spatial Information To The Data Warehouse by Using Geographical Cube Methodology
5 pages
R Tree
No ratings yet
R Tree
11 pages
Analytical Classification of Multimedia Index Structures by Using A Partitioning Method-Based Framework
No ratings yet
Analytical Classification of Multimedia Index Structures by Using A Partitioning Method-Based Framework
12 pages
Multimedia Information Retrieval
No ratings yet
Multimedia Information Retrieval
11 pages
UNIT-V-MCA-305-ADVANCED DBMS
No ratings yet
UNIT-V-MCA-305-ADVANCED DBMS
25 pages
High Dimensional Indexing Transformational Approaches to High-Dimensional Range and Similarity Searches 1st edition by Cui Yu - The ebook version is available in PDF and DOCX for easy access
No ratings yet
High Dimensional Indexing Transformational Approaches to High-Dimensional Range and Similarity Searches 1st edition by Cui Yu - The ebook version is available in PDF and DOCX for easy access
41 pages
Spatial Data Management: Database Management Systems, 3ed, R. Ramakrishnan and J. Gehrke 1
No ratings yet
Spatial Data Management: Database Management Systems, 3ed, R. Ramakrishnan and J. Gehrke 1
7 pages
High Dimensional Indexing Transformational Approaches to High-Dimensional Range and Similarity Searches 1st edition by Cui Yu - The ebook in PDF/DOCX format is ready for download now
100% (6)
High Dimensional Indexing Transformational Approaches to High-Dimensional Range and Similarity Searches 1st edition by Cui Yu - The ebook in PDF/DOCX format is ready for download now
78 pages
Techniques For Efficiently Searching in Spatial, Temporal, Spatio-Temporal, and Multimedia Databases
No ratings yet
Techniques For Efficiently Searching in Spatial, Temporal, Spatio-Temporal, and Multimedia Databases
4 pages
Compact Structure Hashing Via Sparse and Similarity Preserving Embedding
No ratings yet
Compact Structure Hashing Via Sparse and Similarity Preserving Embedding
12 pages
Chapter 5. Paper 1: Fast Rule-Based Classification Using P-Trees 5.1. Abstract
No ratings yet
Chapter 5. Paper 1: Fast Rule-Based Classification Using P-Trees 5.1. Abstract
22 pages
IJRET - Scalable and Efficient Cluster-Based Framework For Multidimensional Indexing
No ratings yet
IJRET - Scalable and Efficient Cluster-Based Framework For Multidimensional Indexing
5 pages
Adl - 00 (2021 - 07 - 30 08 - 37 - 35 Utc)
No ratings yet
Adl - 00 (2021 - 07 - 30 08 - 37 - 35 Utc)
7 pages
G3 - R-Tree, R+-Tree
No ratings yet
G3 - R-Tree, R+-Tree
47 pages
R Tree
No ratings yet
R Tree
11 pages
1.1 About Spatial Mining
No ratings yet
1.1 About Spatial Mining
53 pages
Sample Paper PDF
No ratings yet
Sample Paper PDF
3 pages
Ranking Spatial Data by Quality Preferences: Abstract
0% (1)
Ranking Spatial Data by Quality Preferences: Abstract
16 pages
Spatial Data Management: Database Management Systems, 3ed, R. Ramakrishnan and J. Gehrke 1
No ratings yet
Spatial Data Management: Database Management Systems, 3ed, R. Ramakrishnan and J. Gehrke 1
21 pages
A Model For The Prediction of R-Tree Performance: Yannis Theodoridis Timos Sellis
No ratings yet
A Model For The Prediction of R-Tree Performance: Yannis Theodoridis Timos Sellis
11 pages
Dbms
No ratings yet
Dbms
45 pages
NeurIPS-2019-diskann-fast-accurate-billion-point-nearest-neighbor-search-on-a-single-node-Paper
No ratings yet
NeurIPS-2019-diskann-fast-accurate-billion-point-nearest-neighbor-search-on-a-single-node-Paper
11 pages
G Tree PDF
No ratings yet
G Tree PDF
7 pages
A Hybrid Approach For Content Based Image Retrieval System
No ratings yet
A Hybrid Approach For Content Based Image Retrieval System
6 pages
Spatial Query Processing in Geographic Database Systems: Kaist
No ratings yet
Spatial Query Processing in Geographic Database Systems: Kaist
8 pages
Visualizing and Animating Search Operations On Quadtrees On The Worldwide Web
No ratings yet
Visualizing and Animating Search Operations On Quadtrees On The Worldwide Web
7 pages
A62_Vocabulary_tree
No ratings yet
A62_Vocabulary_tree
30 pages
Fast Searching of Nearest Neighbor Using Key Values in Data Mining
No ratings yet
Fast Searching of Nearest Neighbor Using Key Values in Data Mining
5 pages
High Dimensional Indexing Transformational Approaches to High-Dimensional Range and Similarity Searches 1st edition by Cui Yu download
100% (2)
High Dimensional Indexing Transformational Approaches to High-Dimensional Range and Similarity Searches 1st edition by Cui Yu download
44 pages
Spatial Indexing I: Point Access Methods
No ratings yet
Spatial Indexing I: Point Access Methods
52 pages
A FFT Based Technique For Image Signature Generation: Augusto Celentano and Vincenzo Di Lecce
No ratings yet
A FFT Based Technique For Image Signature Generation: Augusto Celentano and Vincenzo Di Lecce
10 pages
Bulk Loading The M-Tree To Enhance Query Performance
No ratings yet
Bulk Loading The M-Tree To Enhance Query Performance
13 pages
Audio Signa
No ratings yet
Audio Signa
23 pages
WJ96
No ratings yet
WJ96
8 pages
Efficiently Processing Spatial and Keyword Queries in Indoor Venues
No ratings yet
Efficiently Processing Spatial and Keyword Queries in Indoor Venues
7 pages
Data Mining: Concepts and Techniques (2nd Edition)
No ratings yet
Data Mining: Concepts and Techniques (2nd Edition)
9 pages
WINSEM2023-24 CSI2004 TH VL2023240501820 2024-03-09 Reference-Material-I
No ratings yet
WINSEM2023-24 CSI2004 TH VL2023240501820 2024-03-09 Reference-Material-I
63 pages
Content-Based Image Retrieval (CBIR) : Match
No ratings yet
Content-Based Image Retrieval (CBIR) : Match
71 pages
yzelman07b-19
No ratings yet
yzelman07b-19
1 page
Breadth First Search: Fundamentals and Applications
From Everand
Breadth First Search: Fundamentals and Applications
Fouad Sabry
No ratings yet
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Unit II (1)
No ratings yet
Unit II (1)
92 pages
Beyond 5G - Security in 6G Era-v2 Mr.Saro Velrajan
No ratings yet
Beyond 5G - Security in 6G Era-v2 Mr.Saro Velrajan
40 pages
6G Networks_20250111_091331_0000
No ratings yet
6G Networks_20250111_091331_0000
4 pages
SCSB4011_Unit 2
No ratings yet
SCSB4011_Unit 2
45 pages
unit 2 CS (1)
No ratings yet
unit 2 CS (1)
41 pages
Quad Tree
No ratings yet
Quad Tree
8 pages
Introduction to IT-syllabus
No ratings yet
Introduction to IT-syllabus
2 pages
Cointainer Loading and Rat Maze
No ratings yet
Cointainer Loading and Rat Maze
52 pages
Roadmap How To Learn AI in 2024 (Uncovered AI)
No ratings yet
Roadmap How To Learn AI in 2024 (Uncovered AI)
6 pages
My Ai Cheat List
100% (11)
My Ai Cheat List
3 pages
Teas Topics To Study
100% (12)
Teas Topics To Study
6 pages
The Secrets of A Slot Machine
No ratings yet
The Secrets of A Slot Machine
4 pages
From Music To Mathematic
100% (1)
From Music To Mathematic
4 pages
2045: The Year Man Becomes Immortal
No ratings yet
2045: The Year Man Becomes Immortal
9 pages
Tech Trend 2024 Report-2
No ratings yet
Tech Trend 2024 Report-2
11 pages
Rationality From AI To Zombies
86% (7)
Rationality From AI To Zombies
1,813 pages
Mind Control Patents
100% (1)
Mind Control Patents
41 pages
Wisc V Interpretation
100% (1)
Wisc V Interpretation
8 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
100% (7)
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
145 pages
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
No ratings yet
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
456 pages
Psych Unit 7a Practice Quiz
No ratings yet
Psych Unit 7a Practice Quiz
4 pages
Python Tutorial
71% (7)
Python Tutorial
42 pages
Lab 1 Introduction To My SQL
No ratings yet
Lab 1 Introduction To My SQL
8 pages
How To SELECT in SAP BW Transformations PDF
No ratings yet
How To SELECT in SAP BW Transformations PDF
11 pages
CL Gui Alv
No ratings yet
CL Gui Alv
35 pages
Payroll Full
No ratings yet
Payroll Full
117 pages
MD070 SSN Shipping and RMA Receipt Customization V1 1
No ratings yet
MD070 SSN Shipping and RMA Receipt Customization V1 1
42 pages
Multiple Choice Questions of Microsoft Access
86% (7)
Multiple Choice Questions of Microsoft Access
6 pages
CASE STUDY
No ratings yet
CASE STUDY
6 pages
MCQ'S of DB
No ratings yet
MCQ'S of DB
11 pages
Optim Exit Routine
No ratings yet
Optim Exit Routine
524 pages
Chapter 5-Record Storage and Primary File Organization
100% (1)
Chapter 5-Record Storage and Primary File Organization
64 pages
Firebird Recovery, Optimization, and Technical Support
No ratings yet
Firebird Recovery, Optimization, and Technical Support
9 pages
EhLib Users Guide 2.0
No ratings yet
EhLib Users Guide 2.0
141 pages
ST12-ABAP Trace Using The Single Transaction Analysis
No ratings yet
ST12-ABAP Trace Using The Single Transaction Analysis
10 pages
Automation Anywhere Version A2019.10 Enterprise On-Premises
No ratings yet
Automation Anywhere Version A2019.10 Enterprise On-Premises
554 pages
Osm2pgsql Performance
No ratings yet
Osm2pgsql Performance
34 pages
Ceng301 Dbms Session 12
No ratings yet
Ceng301 Dbms Session 12
31 pages
Access Notes ICMS
No ratings yet
Access Notes ICMS
50 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

The W-Tree: An Index Structure For High-Dimensional Data: King-Lp Lin, H.V. Jagadish, and Christos Faloutsos

Uploaded by

The W-Tree: An Index Structure For High-Dimensional Data: King-Lp Lin, H.V. Jagadish, and Christos Faloutsos

Uploaded by

VLDBJournal,3,517-542 (1994), Ralf Hartmut Gtifing, Editor 517

The W-Tree: An Index Structure for High-Dimensional

King-lp Lin, H.V. Jagadish, and Christos Faloutsos

Received July 12, 1993; accepted May 20, 1994.

Abstract. We propose a file structure to index high-dimensionality data, which are

Key Words. Spatial index, similarity retrieval, query by content.

Many applications require enhanced indexing that is capable of performing similarity

For all of these applications, we rely on an expert to derive features that

Previous work compared the performance of different spatial data structures.

3. Intuition Behind the Proposed Method

3.1 Telescoping Function

In general, the telescoping problem can be described as follows. Given an n x 1

Figure 1. Illustration of the Karhunen Lowe transform

Transform (Ruskai et al., 1992). Fortunately, many data-independent transforms will

3.2 Shape of Bounding Region

Lp(Z, ff) = [ E ( x i - yi)P] lip (1)

4.1 Node Structure

Figure 2. Example of TMBRs (diamonds, spheres) with different o~

Number of active d i m e n s i o n . , = I Number of active dimensions = :Z Number of active dimension., = 2

: Denotes extend indefinitely along the direction

struct TMBR { TVECTOR v;

where T V E C T O R stands for telescopic vector.

Figure 3. Example of a W-1 tree (with diamonds)

4.2 Tree Structure

Figure 4. Example of a TV-2 tree (with sphelres)

Figure 5. Decrease in dimensionality during insertion

accommodate the new object by contracting its center as little as possible.

Handling overflowing nodes is another important aspect of the insertion algo-

Figure 6. Illustration of choose-branch criteria

Algorithm 1. Insert algorithm.

Algorithm 3. Splitting by ordering

Deletion. Deletion is straightforward, unless it causes an underflow. In this case,

Extending and Contracting. As previously mentioned, extending and contracting of

Figure 7. Extending a TMBR (diamond), with ~ = 1

Befon~ Extending After Extending

5.1 Experimental Setup

1. U N I X is a registered Trademark of Novell, Inc.

on these letter-count vectors, appropriately zero-padded. The Hadamard Transform

Figure 8. Exact match queries (# leaf accesses vs. o~)

140 2000 words --~ ~ •

Figure 9. Range queries (tolerance=l)(# of leaf accesses vs. o~)

! i .." .~ ............... ~ ............... -+

Figure 10. Range queries (tolerance=2)(# of disk accesses vs. o~)

240 2000 words --0-- ........-"" ..........

Table 1. Disk access per insertion - object size 100 bytes

Dictionary size Disk access per insertion

5.3 Comparison with R*-Tree

F i g u r e 11. D i s k / l e a f a c c e s s e s vs. d b size - e x a c t m a t c h q u e r i e s

R*-tree: Disk access -e----

20 ..~-"~" o .... B .......

Figure 12. Disk/leaf accesses vs. db size-range queries (tolerance = 1)

2O0 R'-tree: Disk access -e--

.,.., j / " "

Figure 13. Disk/leaf accesses vs. db size-range queries (tolerance=2)

Figure 14. Comparison of space requirements

Comparisonof SpaceRequirements. Figure 14 shows the number of nodes (= pages)

A. Calculation of the Telescopic Minimum Bounding Diamond (TMBD)

Table 2. Example of Diamond Projection in a TV-2 tree

Original diamond Projected diamond

A l g o r i t h m 4. Finding the MBD

B. Telescoping Without Truncation

Given a feature vector of length n, its contraction to length m is achieved through

If n < 2m, A m has a 1 in position (1,1), (2,2), . . . , (2m- n, 2 m - n), (2m- n + 1,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.