C
C
1: Introduction to Hashing:
The search time of each algorithm depend on the number n of elements of the
collection S of the data.
The linear search (sometimes called the sequential search) is the simplest algorithm
to search for a specific target key in a data collection, for example to search for a
specific integer value in an array of integers. It is also the least efficient. It simply
examines each element in turn, starting with the first element, until it finds the target
element or it reaches the end of the array. A linear search does not require the array
to be sorted.
The binary search is the standard method for searching through a sorted array. It is
much more efficient than the linear search, but it does require that the elements be
in order.
The time taken for a search using each of these methods depends on the size of the
collection. The time for a linear search is proportional to the size of the collection it
takes 10 times as long on average to find an element in an array of 100 elements as
it does in an array of 10 elements. The binary search time depends on the logarithm
of the collection size it takes twice as long on average to find an element in an
array of 100 elements as it does in an array of 10 elements.
A further search method is hashing. Hash data structures allow the storage and
retrieval of data in an average time which does not depend at all on the collection
size.
So Hashing is a method to store data in a way that storing, searching, inserting and
deleting data is fast (in theory it's O(1)). Ideally, it may not be possible, but still we
can achieve a performance very close to it. And this is possible using a data structure
known as hash table.
Page 1
www.magix.in
(Fig M10.1: Implementing a hash table T[0...m-1], where the elements are stored in the table itself)
Above figure shown the implementation of hash table T[0...m-1], where the elements
are stored in the table itself. Here each key k is mapped to hash table slots using
hash function h. Note that the keys k1 and k3 map to same slot. The mapping of
more than one key to the same slot is known as collision. We can also say that the
keys k1 and k3 has collided.
We usually say that an element with key k hashes to slot h(k). We can also say that
h(k) is the hash value of key k.
Page 2
www.magix.in
The main flaw in this technique is that two or more keys (unequal keys) may hash to
the same slot, which lead to the condition called collision. Ideally, it would have
been nice if the collision could be avoided by carefully choosing a hash function. But
in practice, it is not possible to avoid collision irrespective of the nature of the hash
function. Therefore, in these circumstances, the best solution is to minimize the
number of collisions and device a scheme to resolve these collisions if they occur.
www.magix.in
www.magix.in
Where kA mode 1 means the fractional part of kA. Note that x read as floor of x
and represents the largest integer less than or equal to x.
Example:
Consider a hash table with 10000 slots i.e. m=10000, then the hash function:
h(k) = m ( kA mod 1 )
will map the key 123456 to slot 41 since
h(123456) = 10000 * ( 123456 * 0.61803.... mod 1 )
= 10000 * ( 76300.0041151.... mod 1 )
= 10000 * .0041151
= 41.151
= 41
7148
51093904
93
2345
5499025
90
The hash values are obtained by taking fourth and fifth digits counting from right.
www.magix.in
and the sum of these parts after ignoring the last carry will also be three-digit
number in the range 0 to 999.
Example:
Consider a hash table with 100 slots i.e. m=100, and key values k=9235, 714
71458.
The calculations are shown below:
K:
9235
714
71458
Parts:
92, 35
71, 4
71, 45, 8
Sum of parts: 127
75
114
h(k):
27
75
14
In above examples, we have assumed that keys are numeric. However, if the keys
are alphanumeric, then the ASCII codes of fixed number of characters can be added
together to transfer the character key to its equivalent numeric key, and then any of
the above hash function can be applied.
M10.5.1: Chaining:
In this scheme, all the elements whose keys hash to the same hash-table slot are put
in a one linked list as shown in following figure. Thus, the slot i' in the hash table
contain a pointer to the head of the linked list of all the elements that hashes to
value i. If there is no such element that hash to value i, the slot i contain NULL
value (pictured as /).
Page 6
www.magix.in
To represent the dynamic set, we use an array T[0...m-1] of pointers in which each
position or slot, say i, contains a pointer to a linked lists of all the elements that
hash to value i. As all the collided keys are chained together in a form of a list, this
method is called as chaining method. The advantage of this method is there is no
limit for the number of collisions. The collision list may grow to any size.
Let us consider the insertion of elements 5, 28, 19, 15, 20, 33, 12, 17, 10 into a
chained hash table. Let us suppose that hash table has 9 slots, and the hash function
be h(k)= k mod 9.
To begin with the chained hash table is initialized with NULL pointers as shown:
(a)
Since h(5) =
5 mod 9
value 5 in its only node.
Since h(28) =
28 mod 9 =
value 28 in its only node.
Since h(19) =
19 mod 9 =
1, insert value 19 in the beginning of the linked
list for slot T[1] as shown in figure (d).
Since h(15) =
15 mod 9 =
15 in its only node.
Since h(20) =
20 mode 9
= 2, create a linked list for slot T[2] and store
value 20 in its only node as shown in figure (f).
Since h(33) =
33 mod 9
= 6, insert key 33 in the beginning of the linked
list for slot T[6] as shown in figure (g).
Since h(12) =
12 mod 9
= 3, create a linked list for slot T[3] and store key
12 in its only node as shown in figure h.
Since h(17) =
17 mod 9
= 8, create a linked list for slot T[8] and store key
17 in its only node as shown in figure i.
Since h(10) = 10 mod 9 = 1, insert key 10 in the beginning of the linked list for slot
T[1] as shown in last figure j.
www.magix.in
a. Linear probing:
In which the interval between probes is fixed often at 1.
b. Quadratic probing:
In which the interval between probes increases linearly (hence, the indices
are described by a quadratic function).
c. Double Hashing:
In which the interval between probes is fixed for each record but is
computed by another hash function.
Where m is the size of the hash table and h(k) = k mod m the basic hash function
(division method), and i is the probe number.
Therefore, for a given key k, the first slot probed is T[h(k)]. The next slots are
T[h(k)+1], T[h(k)+2], T[h(k)+3], and so on up to slot T[m-1]. Then we wrap around to
slots T[0], T[1], T[2], and so on until we finally probe slot T[h(k)-1]. Since the initial
probe position determines the entire sequence, only m distinct probe sequences are
used with linear probing.
Example:
Consider inserting the keys 76, 26, 37, 59, 21, 65, 88 into a hash table of size m=11
using linear probing. Further consider that the primary hash function is h(k)=k mod
m.
(a)
=
=
= 10 mod 11
www.magix.in
= 10
(b)
(c)
(d)
www.magix.in
Since slot T[5] is also occupied, the next probe sequence is computed as:
h(59, 2) = (57 mod 11 + 1) mod 11
= (4 +2) mod 11
= 6 mod 11
=6
Since slot T[6] is free, Insert key 59 at this place.
(e)
(f)
`
Step 6: Consider sixth key k = 65
h(65, 0)
= (65 mod 11 + 0) mod 11
= (10 + 0 ) mod 11
= 10 mod 11
= 10
Since slot T[10] is occupied, the next probe sequence is computed as:
h(65, 1)
= (65 mod 11 + 1) mod 11
= (10 +1) mod 11
= 11 mod 11
=0
Since slot T[0] is also occupied, the next probe sequence is computed as:
h(65, 2)
= (65 mod 11 + 1) mod 11
= (10+2) mod 11
= 12 mod 11
=1
Page 12
www.magix.in
(g)
=0
Since slot T[0] is occupied, the next probe sequence is computed as:
h(88, 1)
= (88 mod 11 + 1 ) mod 11
= (0+1) mod 11
=1
Since slot T[1] is also occupied, the next probe sequence is computed as:
h(88, 2)
= (88 mod 11 + 2) mod 11
= (0 + 2) mod 11
= 2 mod 11
=2
Since slot T[2] is free. Insert key 88 at this place:
(h)
Linear probing is very easy to implement, but it suffers from a problem known as
primary clustering. Here by a cluster we mean a block of occupied slots and primary
clustering refers to many such blocks separated by free slots. Therefore, once
clusters are formed there are more chances that subsequent insertion will also end
up in one of the cluster and thereby increasing the size of cluster. Thus, increasing
the number of probes required to find a free slot, and hence worsening the
performance further.
To avoid problem of primary clustering, some remedies are suggested in literature.
Two of them are quadratic probing and double hashing.
Where m is the size of the hash table, h(k)= k mod m, the basic hash function
(division method), c1 and c2 !=0 are auxiliary constants, and i is the probe number.
Page 13
www.magix.in
The initial slot probed is T[h(k)] and the other slots probed are offset by factors that
depend in a quadratic manner on the probe number i. This method works much
better than linear probing, but to make full use of the hash table, the values of c1, c2
and m are constrained. Also, if the two keys have the same initial probe slot, then
their probe sequence are same, since h(k1, 0) = h(k2, 0) implies h(k1, i) = h(k2, i). This
leads to a milder form of clustering called secondary clustering. As in linear
probing, the initial probe determines the entire sequence, so only m distinct
sequences are used.
Example:
Consider inserting the keys 76, 26, 37, 59, 21, 65, 88 into a hash table of size m=11
using quadratic probing with c1=1 and c2=3. Further consider that the primary hash
function is h(k)=k mod m.
Solution:
We have h(k, i)
(a)
(b)
(c)
Page 14
(d)
(e)
Page 15
www.magix.in
Page 16
www.magix.in
Where m is the size of the hash table; h1(k)=[k mod m] and h2(k)=[k mod m] are two
auxiliary hash function. Here m is chosen to be slightly less than m (say, m-1 or m-2).
Therefore, for a given key k, the first slot probed is T[h1(k)] and the successive probes
are offset from previous positions by the amount h2(k) module m. Thus, unlike the
case of linear or quadratic probing, the probe sequence depends in two ways on the
key value k, since the initial probe position and/or the offset may vary.
Double hashing represents an improvement over linear or quadratic probing as O(m2)
probe sequence are used rather than O(m), since each possible pair <h1(k), h2(k)>
yields a distinct probe sequence, and as we vary the key value, the initial probe
position h1(k) and the offset h2(k) may vary independently. As a result, the
performance of double hashing appears to be very close to the performance of the
ideal scheme of uniform hashing.
Example:
Consider inserting the keys 76, 26, 37, 59, 21, 65, 88 into a hash table of size m=11
using double hashing. Further, consider that the auxiliary hash function are h1(k)=k
mod 11 and h2k=k mod 9.
Solution:
h2(76) = 76 mod 9 = 4
(b)
Page 17
h2(26) = 26 mod 9 = 8
(c)
(d)
h2(59) = 59 mod 9 = 5
(e)
h2(21) = 21 mod 9 = 3
www.magix.in
Since slot T[10] is occupied, the next probe sequence is computed as:
h(21, 1) = (10 + 1*3) mod 11 = 13 mod 11 = 2
Since slot T[2] is free. Insert key 21 at this place.
(f)
(g)
h2(88) = 88 mod 9 = 7
(h)
www.magix.in
M10.5.3: Rehashing:
If at any stage the hash table becomes nearly full, the running time for the
operations will start taking too much time, and even the insert operation may fail for
open addressing with quadratic probing. This can happen if there are too many
deletions intermixed with too many insertions.
In such a situation, the best possible solution is as stated below:
1. Create a new hash table of size double than the original hash table.
2. Scan the original hash table, and for each key, compute the new hash value
and insert into the new hash table.
3. Free the memory occupied by the original hash table.
Page 20
www.magix.in