0% found this document useful (0 votes)
48 views

CBIR Papers For PHD

The document summarizes a research paper on using a cluster system for content-based multimedia retrieval. It proposes a data partitioning scheme that uses system characteristics like processor speed and storage to distribute data among heterogeneous processors. Experiments show this heterogeneous model significantly reduces retrieval time from a 30,000 image database compared to a previous homogeneous parallel retrieval model.

Uploaded by

kalpesh_chandak
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

CBIR Papers For PHD

The document summarizes a research paper on using a cluster system for content-based multimedia retrieval. It proposes a data partitioning scheme that uses system characteristics like processor speed and storage to distribute data among heterogeneous processors. Experiments show this heterogeneous model significantly reduces retrieval time from a 30,000 image database compared to a previous homogeneous parallel retrieval model.

Uploaded by

kalpesh_chandak
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

International Conference on High Performance Computing and Networking Europe (HPCN Europe-99),

Amsterdam, The Netherlands, April 12 - 14, 1999


For other papers of the PDC group at GWU, go to http://www.seas.gwu.edu/seas/eecs/Research/Parallel-Distributed

Content-based Multimedia Data Retrieval on Cluster


System Environment

Sanan Srakaew, Nikitas Alexandridis, Punpiti Piamsa-nga, George Blankenship

Department of Electrical Engineering and Computer Science


The George Washington University
nd
801 22 St., N.W., Washington, DC 20052
{srakaew, alexan, punpiti, blankeng}@seas.gwu.edu

Abstract. In this paper, we propose a data partitioning scheme for content-


based retrieval from a multimedia database using a heterogeneous cluster
system. The proposed parallel unified model [6] was used to represent
multimedia data. All types of multimedia data in the unified model are
represented by k-dimensional(k-d) signals. Each dimension of k-d data is
separated into small blocks and then formed into a hierarchical
multidimensional tree structure, called a k-tree. The parallel version of k-tree
model was introduced in [7]. The previous experimental results show the huge
reduction of retrieval time on a cluster of homogeneous workstations. In this
paper, we extend our parallel model to a heterogeneous cluster system
environment. We demonstrate the experimental results of using a parallel
retrieval algorithm for the k-tree unified model on a cluster of heterogeneous
system connected via a network. We use system characteristics to help in
partitioning the data and balancing the loads of the processors. The experiments
of the model with load balancing shows a significant reduction of retrieval time
while maintaining the quality of perceptual results.

1 Introduction

Multimedia databases have become more important since conventional databases


cannot provide the necessary efficiency and performance. Multimedia databases
encounter three major difficulties. First, the content is subjective information; that is,
intelligence is required to characterize the data. The data recognition requires prior
knowledge and special techniques in Computer Vision and Pattern Recognition; the
problems are NP-hard. Second, if a method or processing technique is designed and
developed for one type of data or feature, it usually not appropriate for others. For
instance, a technique designed for indexing audio data may not be usable for image
data; or, a technique developed for a color feature may not be useful for a texture
feature in image and video data. Third, the usual huge size of multimedia data and the
requirement for a similarity search affect the computation. A similarity search is
desirable for a multimedia database system. If a picture of a house is used as a query
to an image database, we expect to retrieve pictures that contain similar houses in
them. The comparison is not pixel by pixel between a query and the records in a
database; but rather, closeness to the query. Similarity matching requires the
computation of the distance between a query and each record in the database; the best
match is chosen from the data set with the smallest distances. To solve these three
problems, we use a mathematical model to represent the features; a k-tree model to
represent the data structures of the multimedia data; and exploit parallelism to reduce
the retrieval time [7].
In this paper, color and texture are the features of interest; they represent the
subjective information of the multimedia data. We use a normalization technique to
generate the indices. The domain of a feature is reduced to a set of selected values
from a universe of potential values for the feature. We use an identification number
for each element in the reduced set [7]. When data is inserted into the system, it is
converted to the selected domain. The feature is represented by a histogram. For color
feature, a few colors are picked from the whole infinite universe of colors. A finite
number indexes each color. The color feature of an image or a video is represented by
a histogram using the indexed color. For texture feature, we selected a set of textures
and assigned an identification number to each texture. The feature of a texture is
represented by the histogram of texture identification, which is the same method that
was used for the color feature. The comparison of two features is based upon the
distance between the histograms that define the features.
To reduce the response time, one may use a parallel model of a homogeneous system
to perform a content-based multimedia retrieval. The experimental results were very
positive in both qualitative and quantitative metrics. However, in practical, we do not
have dedicated machines that always have the same configurations. The homogeneous
model may be not used efficiently enough in the real-life heterogeneous environment.
In this paper, we investigate a data partitioning scheme for multimedia database
retrieval on a heterogeneous cluster system. We use system characteristics, such as
processor speed and available storage, to partition data among the processors in the
systems. Our computer system environment is composed of Sun Sparc and Pentium-
Linux machines, which are connected via a 10Mbit local area network. We evaluate
the model by comparing the retrieval results with the previous homogeneous parallel
retrieval. The experimental results show the heterogeneous model produces a
significant reduction of the retrieval time of an image from a 30,000-record image
database.
This paper is organized as follow. Section 2 describes our k-tree parallel model.
Section 3 has the details of our cluster system environment and its heterogeneity. The
experiment and its results are described in Section 4 The last section concludes our
works and proposes future directions.

2 The K-Tree Parallel Model


k
A k-tree is a directed graph; each node has 2 incoming edges and one
outgoing edge with a balanced structure [6]. A k-tree is a binary tree for 1-
dimensional data and a quadtree for 2-dimensional data. Exploiting a k-tree brings
three main benefits. First, the k-tree holds the information of spatio-temporal data on
the tree structure itself. It reduces distance computation time to a comparison between
two tree nodes. Second, a k-tree can accelerate multiresolution processing by
calculating small, global information first and then large, local information when
precise resolution is needed. Third, the data on a k-tree is unified since only the
degree of the tree changes, while the processing algorithm and data structure remain
invariant. Therefore, an algorithm for a particular type of feature can be reused for a
feature of another media type.
Content-based retrieval of multidimensional signals is done by comparing features
extracted from the input query with features extracted from every record in the
database. The features of a multidimensional signal are subjective information. They
are characteristics that are used to distinguish one signal from others. A 2-dimensional
signal, such as an image, is characterized by features such as color, texture, and
intensity. The basic algorithms for the searching of data in each of the different
domains are quite similar. A matching search requires that the index key (defining
feature) be unique and matched to the query. Exactly matched searching requires
exhaustive comparisons that are inefficient and unsuitable for multidimensional
signals; similarity searching is more appropriate. A similarity-search re-orders the
database by distance between each record and the query; the result is selected from
the ranking.
Multimedia data retrieval requires similarity searching; exact matching, which is used
in conventional database, is not appropriate for this type of application. Similarity
searching generally can be done in two steps; 1) finding distances between a query
and all records in the database and then 2) sorting the distances and returning the
results – the set of data items that have shortest distances. We also call this process
“ranking.” The details of regular weighted ranking are discussed in [7]. In Figure 1,
we show a parallel searching using multi-feature scheme. Prior to the search the
database is distributed among processors. Each processor performs the comparison
between the query and its portion of the database. The search results based on those
features are sorted in parallel to create the final ranking.

Temporal
Spatial

T1 T2

List of w1
F1 Distances
Set1 Set1

List of w2
F2 Distances
Extracted Set2 SetK
Features Ranking
of Query Results

List of wK
FK Distances
SetK SetK

D
Parallel Sorting
and Merging
Distance Computation Database: D = D1+D2+…+DN
Processors: Set1, Set2,…, SetK
Set1: P1, P2,…, Pm
Set2: Pm+1, Pm+2,…, P2m
SetK: PN-m, PN-m+1,…, PN

Feature: F1, F2,…, FK

Figure 1. Parallel Model Multi-feature Searching


3 Cluster System Environment

In this section, we give the details of our experimental platforms, evaluated


multimedia database systems, and our proposed load-balancing algorithm for the k-
tree model.

3.1 Computational platform

A cluster system environment plays an important role in a distributed multimedia


database. A simple model of a distributed environment is a homogeneous system,
where each workstation is identical and communication links between the
workstations are also comparable. In such a system, data partitioning can be evenly
subdivided. A static load-balancing scheme can be applied. A system environment is
usually heterogeneous; the workstations are different in terms of computational
power, storage space, hardware architecture, and so on. In this paper, we establish a
uniform framework for content-based multimedia data retrieval on a cluster system
environment. In our experiment, one of the workstation serves as a host while others
are assigned to be slave nodes. The system is composed of seven machines, which
include two models of Sun workstations and Pentium/Linux PCs connected via local
area network and use the Message Passing Interface (MPI) library as a message-
passing interconnection mechanism [8].

3.2 Multimedia database system

In this paper, we use an image database which uses histogram-based features as


indices. Two types of histogram-based feature (colors and textures) have been
examined. Before beginning the extraction of features, all images are normalized,
scaled down to 128x128 pixels. The color feature extraction is performed in two
steps. The first step transforms the number of colors of the scaled images to a pre-
selected 166-color set [5]. The second step stores the transformed image in a quad tree
structure. Texture feature extraction requires three steps. The first step transforms the
64 blocks of 16x16 pixels in to 64 sets of wavelet data using a Quadratic Mirror Filter
(QMF) (2 iterations, 7 sub-bands) [7]. Each wavelet data produces seven subbands of
means and variances; i.e. a 14-element vector. In the second step, the texture vectors
are then compared to 162-reference textures from VisTex [7] in the known texture
table to generate 64 texture indices representing textures for blocked data. The third
step constructs and stores the texture features in a quad tree structure.
The steps of the quad tree generation are the same for both features. The transformed
color images and texture-identification (texture-id) matrices are mapped onto the
leaves of a quad tree structure. The leaves represent a single pixel of the normalized
image. Histograms of the leaves, which share the same parent nodes, are summed and
the results are stored at their parent nodes. The process continues iteratively for each
level until the root has been reached.
3.3 Data Partitioning and Load Balancing

To exploit heterogeneous parallelism, we use task and machine characteristics to


decide which processors the tasks should be allocated to. The heterogeneity in the task
level is the differences of the searching into the feature indices and the heterogeneity
of the machines includes processor power and communication speed. The general idea
is to give more data to a machine whose estimated execution time is smallest. Data
given to a particular machine can be locally stored or remotely accessed via a faster
network link. For example, suppose a system consists of two machines; one is twice
faster than another. Thus, the faster workstation should have two-third of database and
the slower one should have the one-third portion. However, if the faster machine
doesn't have enough disk space to store all two-third portion, parts of database needed
on this workstation can be remotely kept on another one. The size of remotely kept
data is dependent of average link latency between these two machines. We can
generalize the load distribution in a mathematical model as follows.
Let S be a set of heterogeneous machines; S = {M1, M2, …, Mm}; m be the number of
machines; Mi(Ci,Si,Li) be a machine described by computational power Ci, an
available storage Si, and estimated network bandwidths Li; Li ={λij, ∀ i ≠ j}. λij is an
average network latency between Mi and Mj for all i ≠ j.
Let B a data base size and EI be an estimated execution time of task T on Mi
Ei = Ci + Ri
where Ci is an estimated computational time and
Ri is an estimated IO time
Let Sij be a portion of B on Mj that Ei needs to access, therefore
Ei = Ci∗Σ[pij] + (Ri/λii)∗Σ [pij*λij]
Our goal is to minimize Ei subject to the following constraints:
Σ[pij] = 1
Σ[pij] ≤ Si/B, 1≤j≤m
Ei = Ej , ∀i≠j

4 Experiments and Results

The cluster environment consists of seven Unix-based machines; two Pentium/Linux-


based PCs, two Sun Sparc-20 workstations, and three Sun Ultrasparc I workstations.
One of Sun Ultrasparc is designated as the host. The MPI library is used as the
interconnection mechanism. We use comparative computational power to classify
workstation types. The computational power ratios are based on the results similarity
searching a database of 500 images using single- and multi- feature storage
algorithms. The computational power ratio of a Sun Ultrasparc I to a Pentium II PC to
a Sun Sparc-20 is 3:2:1.8.
In this experiment, image retrieval is performed using two features; color and texture.
The extracted features are derived from database images of 128x128 pixels; each is
evenly divided into 64 blocks. The quadtree of the histograms for each image is made
up of 3 levels; 64 leaf nodes. We perform two data partitioning schemes based on the
database of 30,000 images. In the first scheme, the database of color and texture
histograms is evenly divided among workstations, without considering the
heterogeneity of cluster environment. In the second approach, data partitioning is
based on the ratio of computational power of the workstations and available storage.
The results shown in Figure 2(a) depicts the response time of the system as a function
of the number of processors used to perform the ranking; the selection features are
both color and texture. Figure 2(b) shows speedup as a function of the number of
processors. As the number of processors used to perform the computation increases,
the computation time decreases significantly. Moreover, the data partitioning based on
heterogeneity information achieves a higher speedup than an even distribution
approach. Figure 3 depicts the top-twenty output images on the sorted list when both
color and texture are used as selection features.

1000 7

900 Uniform Distribution Uniform Distribution


Heterogeneity based 6
Heterogeneity Based
800
Response Time (seconds)

700
5

600

Speedup
500 4

400

3
300

200
2

100

0 1
1 2 3 4 5 6 7 1 2 3 4 5 6 7

Number of Processors Number of Processors

Figure 2. (a). Response time, (b) Speedup

Input Input
query query

Output Output

Figure 3. Perception output; the selection features are color and texture
5. Conclusions

We introduced a parallel model for multimedia database content-based retrievals on a


cluster of heterogeneous workstations. The model allows the extension of the system
for the new types of data, new techniques, and new types of interest contents with less
effort. The experimental results show that heterogeneous processing with load
balancing can reduce retrieval time over a homogeneous approach. Data partitioning
based on system heterogeneity achieves a better response time in comparison with
uniform distribution of database over a cluster of workstations. Our future work will
focus mainly on classification technique based on multi-resolution structure of the k-
tree at different levels. Some load balancing and process migration techniques are also
in our future work.

References

1. P. Chalermwat, N. Alexandridis, P. Piamsa-nga, and M. O'Connell, Parallel image


processing on heterogeneous computing network systems, International Conference on
Image Processing, 1996.
2. T. El-Ghazawi, P. Chalermwat, P. Piamsa-nga, A. Ozkaya, N. Speciale, and D. Wilson,
PACET: PC-parallel architecture for cost-efficient telemetry processing, IEEE Aerospace
Conference, 1998.
3. V. Gudivada and V. Raghavan, “Special issue on content-based image retrieval systems,” in
IEEE Computers, Vol. 28, No. 9, September 1995.
4. Z. Kemp, “Multimedia and spatial information systems,” IEEE Multimedia, 2(4), 1995.
5. J. R. Smith and S.-F. Chang, SaFe: “A General Framework for Integrated Spatial and Feature
Image Search,” IEEE Workshop on Multimedia Signal Processing, 1997.
6. P. Piamsa-nga, N. Alexandridis, G. Blankenship, G. Papakonstantinou, P. Tsanakas, and S.
Tzafestas, “A Unified Model for Multimedia Retrieval by Content,” International
Conference on Computer and Their Application (CATA98), 1998.
7. P. Piamsa-nga, N. Alexandridis, S. Srakaew, and G. Blankenship, “A parallel algorithm for
multi-feature content-based multimedia retrieval,” Seventh International Conference on
Intelligent Systems (ICIS98), Paris, France, July 1-3, 1998.
8. S. Srakaew, N. A. Alexandridis, P. Piamsa-nga, and G. Blankenship, “A parallel model for
multimedia retrieval based on multidimensional signal structure,” in International workshop
on systems, signal and image processing (IWSSIP98), Zagreb, Croatia, June 3-5, 1998.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy