0% found this document useful (0 votes)

164 views

Hashing

Hashing is a technique for inserting or retrieving records from a table in constant time by using a hash function to map a key to a specific position in the table. Collisions occur when the hash function produces the same result for different keys. There are four main solutions to resolve collisions: (1) open addressing, which involves probing to the next empty slot if a collision occurs; (2) separate chaining, which uses linked lists at each table position to handle collisions; (3) linear probing, which sequentially searches the next table positions if a collision occurs; and (4) quadratic probing, which searches positions that are farther apart if a collision occurs. The size of the hash table and choice of hash function are important

Uploaded by

vicrattlehead2013

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

164 views

Hashing

Uploaded by

vicrattlehead2013

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

HASHING

INTRODUCTION

Hashing is the term used for inserting or retrieving a record into a table (or file) in
constant time.

In this Chapter some commonly used terms are defined. Hashing is explained in greater
detail. Firstly a trivial example is done to set the scene. This example is then extended to
real situations. It is possible for the Hash function to produce the same result for different
keys. This phenomenon is known as a collision. Four different solutions for resolving
collisions are discussed and compared.

Using keys comprising characters is discussed. Also computational issues that arise when
large keys are used are discussed and solutions given

Finally some commonly used hash functions are described and discussed.

TERMS USED

KEY: The key is some data associated with the record that UNIQUELY identifies the
specific record. ( ie The record may consist of: Name, Address, ID number. The ID
number is UNIQUE (no 2 people have the same ID) thus it can be used as the key of the
record).

X DIV Y: The DIV function returns the integer value equal to the number of times Y
divides into X. Any remainder is disreguarded. (X and Y are integers).

Eg 12 DIV 3 = 4 (Exact)
13 DIV 3 = 4 (Remainder 1, which is disreguarded)
14 DIV 3 = 4 (Remainder 2, which is disreguarded)
15 DIV 3 = 5

X MOD Y: The MOD function returns the remainder after division of X by Y. (X and
Y are integers).

Eg 27 Mod 10 = 7 (2 * 10 + 7 = 27)
17 Mod 10 = 7
36 Mod 10 = 6 (3 * 10 + 6 = 26)
5 Mod 10 = 5 (0 * 10 + 5 = 5)

10 Mod 10 = 0 (1 *10 + 0 = 10) (No remainder)

1
30 Mod 10 = 0 30 Mod 11 = 8
30 Mod 13 = 4 26 Mod 13 = 0 131 Mod 13 = 1

From these two functions:

X = (X DIV Y) * Y + (X MOD Y)

HASH FUNCTION h: h(key) = POSITION in which to place the record in the table.
This is a function, that uses the key of the record, to calculate the position in which to
place the record in the table.

COLLISION: The hash function may produce the same result for several different keys.
Collisions are then said to occur. These will be resolved a little later on.

TRIVIAL EXAMPLE

Problem: Assume that we have records about exactly 10 people.

Each record is comprised of: ID Number, Name and Telephone Number.

The ID numbers of the 10 people are respectively 0,1,2,3,4,5,6,7,8,9.

The ideal data structure is an array of records called TABLE as shown below:

ID number Name Tel no

0 Mike 2341
1 Pam 2378
2 John 1289
3 Sally 3478
4 Tom 3478
5 Peter 2378
6 Colin 7691
7 Janet 1479
8 Mary 8643
9 Pam 9876

The Hash function h(key) is simply the trivial function:

h(key) = key = Row number in the array

In other words if the key is 3 then you will find the desired record in the 4th row of the
array (remember that arrays start at row 0).

The advantages of this method are:

2
1) we can insert, find or delete a record in CONSTANT time (it is just 1 operation);
and
2) the table is exactly the correct size (neither too big nor too small).

The disadvantage of this method is:

1) if we wish to add an 11th person it’s a hassle. We would have to define a new
array of size 11. We would have to re-hash all the records from our old table to
the new one. Finally we could add the 11th record to the new table.

In this example it is not necessary to store the ID number because it is the same as the
row index of the table. Later on, in more complicated cases where there can be clashes, it
is essential to keep the key because we need to check that we have found the correct
record.

NOT SO TRIVIAL EXAMPLE

What if the ID numbers of the people were: 0, 1, 2, 3, 4, 5, 6, 7, 8, 999

There are still only 10 students but their ID numbers range from 0 to 999.

We choose the same method of storage giving;

ID number Name Tel no

0 Mike 2341
1 Pam 2378
2 John 1289
3 Sally 3478
4 Tom 3478
5 Peter 2378
6 Colin 7691
7 Janet 1479
8 Mary 8643
… … …
… … …
… … …
999 Pam 9876

The advantage of this method is:

1) we can insert, find or delete a record in CONSTANT time.

The disadvantage of this method is that the:

3
1) Table size is much greater than the number of entries in the table. The Table size
is 1000 and the number of entries is only 10. This means 99% of the table is
wasted space which is very poor space utilization

The challenge is to save space. This is achieved by using a more sophisticated Hash
function. More about this shortly.

In real life there are many problems that show the behaviour above where the number of
entries in the table is far smaller than the maximum entry key. Often however the entries
are more randomly distributed over the table. For example the 10 entries may have the
ID’s of: 2, 45, 67, 123, 345, 346, 459, 554, 721, 999 .

A SIMPLE HASH FUNCTION

As before we wish to store the following 4 records in a table:

ID Name Tel No
key
81 John 12345
64 Sally 23456
39 Mary 34567
46 Philimon 87654

Now we choose a more complex Hash function: h = (key) MOD 11

To insert the 1st record: 81 John 12345

Hash h = (81) Mod 11 = 4 (11 divides into 81 seven times (77) with a remainder of 4)

So we insert the record in position (or row) 4 of the table

ID Name Tel No
0
1
2
3
4 81 John 12345
5
6
7
8
9
10
Position Key

4
To insert the next record: 64 Sally 23456

h = (64) Mod 11 = 9 Thus this record is inserted in position 9.

ID Name Tel No
0
1
2
3
4 81 John 12345
5
6
7
8
9 64 Sally 23456
10
Position Key

To insert the next record: 39 Mary 34567

h = (39) Mod 11 = 6 This record is inserted in position 6 of the table

ID Name Tel No
0
1
2
3
4 81 John 12345
5
6 39 Mary 34567
7
8
9 64 Sally 23456
10
Position Key

5
To insert the next record: 46 Philimon 87654

h = (46) Mod 11 = 2

ID Name Tel No
0
1
2 46 Philimon 87654
3
4 81 John 12345
5
6 39 Mary 34567
7
8
9 64 Sally 23456
10
Position Key

Let us consider what has happened here:

1) The Hash function for each entry was calculated and the record inserted in the
table in that position;
2) This example specifically contained no collisions. In other words each entry
hashed to a unique position. This is artificial and I will show next what to do
when collisions occur;
3) I chose a table size at least double the number of records to be entered. This is
to ensure that the method works relatively efficiently;
4) The table size was specifically chosen as a prime number. In later examples
the method is only guaranteed to work when the table size is a prime.

* IMPORTANT NOTE: From now on I will omit the Name & Tel No information from
the table. In practice it is there. This is just to save time and space in future examples

ID
0
1
2 46
3
4 81
5
6 39
7
8
9 64
10
Position Key

6
Collisions

Life is not perfect with Hashing. The Hashing function can produce the same position to
place two or more records. Clearly this is unacceptable. The simplest solution when this
happens is just to try the next position in the table & so on until an empty slot is found.

Lets continue the example:

Insert: 20 Joe 1342

h = (20) Mod 11 = 9

We attempt to insert this record into position 9. We cant because it is full.

Add 1 to the position i.e. position 10 & try there. Fine it is empty so insert the record
there (Note I am omitting the Name & Tel No from the description).

ID
0
1
2 46
3
4 81
5
6 39
7
8
9 64
10 20
Position Key

Now insert the record: 31 Zoe 1328

h = (31) Mod 11 = 9

We attempt to insert this record into position 9. We cant because it is full.

Add 1 to the position i.e. position 10 & try there. No good it is full.
Add 1 to the position i.e. position 11. Well position 11 does not exist. It is off the end of
the table. So try position 0 at the start of the table instead. It is empty so insert the record.
there. (If it was full you would try position 1 & so on until you find an empty position).

The mathematical description of going beyond the end of the table & wrapping around to
the start to the table can be simply written as Mod (Table-size) or Mod 11 in this case.

7
The final table is:

ID
0 31
1
2 46
3
4 81
5
6 39
7
8
9 64
10 20
Position Key

The advantages of this method are:

1) The method is simple to implement; and
2) As long as there is an empty slot in the table a new record can be inserted.

The disadvantages of the method are:

1) It can be inefficient. It often occurs that a large blocks of fully occupied cells
form inside the table. This is called Primary Clustering. When a new record
is hashed to some value in this full cluster it takes many attempts before an
empty slot can be found. This is illustrated in the next example.
2) If the table is 50% full then it can be shown that an empty slot can be found
after 2.5 attempts. If the table is 90% full then some 50 attempts are required
to find an empty slot.
3) To keep the method efficient we must keep the table at most half full. Once
this happens a new table must be set up. Its size needs to be the 1st prime
number that is at least double the old table size. Then we must re-hash all the
records in the old table into the new table (This is a lot of work). Then we can
carry on.
4) Deletion problem. This occurs when a record is deleted from the table and this
record is on the path to some other record that was subsequently inserted. The
search hits an empty slot, stops and reports that the record is not there. In fact
it is but the method can not find it. The solution to this problem is to mark
each position as EMPTY, FULL or DELETED. When we a searching for a
record if the position is marked as DELETED the search continues. The
search only stops when the record is found or a EMPTY slot is found. This is
illustrated in the next example.

8
HASHING with OPEN ADDRESSING -- LINEAR PROBING

What I have described is Hashing with open addressing with collisions resolved by
Linear probing.

This can be written in a general form as: Hash = (h(x) + f(i) i=0,1,2…) Mod (Table-size)

In the next example: h(x) = key Mod 11, the Table-size = 11 and f(i) = i, giving

Hash = ( key Mod 11+ i i=0,1,2…) Mod 11

For illustration purposes 5 records are hashed into the table. These records have the keys:
51, 62, 74, 19 ,73

Key Hash
51 7 OK
62 7 => (7 + 1) = 8 OK
74 8 => (8 + 1) = 9 OK
19 8 => 9 =>10 OK
73 7 => 8 => 9 => 10 =>11 = 0 OK

Empty After 51 After 62 After 74 After 19 After 73

0 73
1
2
3
4
5
6
7 51 51 51 51 51
8 62 62 62 62
9 74 74 74
10 19 19

Now lets FIND record 73. We use exactly the same procedure as inserting the record.
Key Hash
73 7 Is the key in position 7 = 73 (51<>73) No Try next position
8 Is the key in position 8 = 73 (62<>73) No Try next position
9 Is the key in position 9 = 73 (74<>73) No Try next position
10 Is the key in position 10 = 73 (19<>73) No Try next position
0 Is the key in position 0 = 73 (67<>73) Yes Found
Right this example WORKS.

9
Now : DELETE record 74. The resulting table is:

Empty After 51 After 62 After 74 After 19 After 73 After

74
Deleted
0 73 73
1
2
3
4
5
6
7 51 51 51 51 51 51
8 62 62 62 62 62
9
10 19 19 19

Lets try to FIND record 73. We use exactly the same procedure as inserting the record.

Key Hash
73 7 Is the key in position 7 = 73 (51<>73) No Try next position
8 Is the key in position 8 = 73 (62<>73) No Try next position
9 Record EMPTY thus stop search => record NOT found. This is incorrect.

WE needed to mark this record as DELETED so the search can continue on to position
10 then 0 where the record is found.

HASHING with OPEN ADDRESSING -- QUADRATIC PROBING

The next strategy is to use f(i) = i2

We try 12 =1 position away from the original hash, then 22 = 4 positions away, then 32 =
9 positions away & so on.

Hash = (h(x) + f(i) i=0,1,2…) Mod (Table-size)

In the next example: h(x) = key Mod 11, the Table-size = 11 and f(i) = i2
Which gives;

Hash = ( key Mod 11 + i2 i=0,1,2…) Mod 11

10
For illustration purposes 6 records are hashed into the table. These records have the keys:
51, 62, 74, 19, 73

Key Hash
51 7 OK
62 7 => (7 + 12) = 8 OK
74 8 => (8 + 12) = 9 OK
19 8 => (8 + 1 ) = 9 => (8 + 22) = 12 Mod(11) = 1
2
OK
73 7 => (7 + 12) = 8 => (7 + 22) = 11 Mod(11) = 0 OK
2 2 2
84 7 => (7 + 1 ) = 8 => (7 + 2 ) = 11 Mod(11) = 0 => (7 + 3 ) = 16 Mod(11) =5 OK
95 7 => (7 + 12) = 8 => (7 + 22) = 11 Mod(11) = 0 => (7 + 32) = 16 Mod(11) =5
=> (7 + 42) = 23 Mod(11) = 1 => (7 + 52) = 32 Mod(11) = 10 OK
106 7 Now try it yourself – you will discover that while there are still empty
slots in the table the hash function wont find them – POTENTIAL DISASTER! Don’t
worry there is a solution.

Empty After After After After After After After After

51 62 74 19 73 77 84 95
0 73 73 73 73
1 19 19 19 19 19
2
3
4
5 84 84
6
7 51 51 15 51 51 51 51 51
8 62 62 62 62 62 62 62
9 74 74 74 74 74 74
10 95

Advantages of the method:

1) Primary clustering is reduced & effectively eliminated.

Disadvantages of the method:

1) the calculation to do the hash is a bit more complex;
2) the table can still have empty slots but the hash function & collision resolution
does not find them. This problem can be solved in a simple fashion. The table size
must be a prime number and the table MUST be kept less than half full. When
these two conditions are met it can be proved that a record will be placed in the
table. Once this happens a new table must be set up. Its size needs to be the 1st
prime number that is at least double the old table size. Then we must re-hash all
the records in the old table into the new table (This is a lot of work). Then we can
carry on inserting further values.

11
3) While quadratic probing is better than linear probing it still has a disadvantage.
Every record that hashes to the same position ( 62, 73, 84 ,95 ) all follow exactly
the same path away from the original position occupied by 51. As more records t
hash to this position the longer the collision resolution process takes to find an
empty slot. This is known as secondary clustering

HASHING with OPEN ADDRESSING -- DOUBLE HASHING

The next strategy is to use f(i) = i * h2(x)

When we have a collision we use a second hash function to determine the distance away
from the first position to try to placethe record, if that fails we move that distance again &
so on i.e.
We try at:
h1(x)
h1(x) + h2(x)
h1(x) + 2 * h2(x)
h1(x) + 3 * h2(x); and so on

Hash = (h(x) + f(i) i=0,1,2…) Mod (Table-size) becomes

Hash = ( h1(x) + i * h2(x) i=0,1,2…) Mod (Table-size)

To ensure that this method words the following conditions must hold:
1) h2(x) must NOT be ZERO ( If it was ever zero then not other positions would ever
be tried)
2) All cells must be capable of being tried. The following formula for h2(x) is a
good one:
h2(x) = R – (x mod R) where R is a prime number less than the Table-size

In the next example: h1(x) = key Mod 11, the Table-size = 11 and
h2(x) = 7 – (key mod 7)

The following 6 records with keys: 38 , 1, 16 ,49, 11, 60 will be hashed to the table.

Key h1(x) h2(x) = 7 – (key mod 7) h1(x) + i * h2(x) i=1,2,3..

38 38 Mod 11 = 5 OK
1 1 Mod 11 = 1 OK
16 16 Mod 11 = 5 full 7- 16 Mod 7=7–2=5 Hash=5+5=10 OK
49 49 Mod 11 = 5 full 7- 49 Mod 7=7–0=7 Hash=5+7=12 Mod 11 = 1 full
Hash=5+14=19 Mod 11 =8 OK
11 11 Mod 11 = 0 OK
60 60 Mod 11 = 5 full 7- 60 Mod 7=7-4=3 Hash=5+3=8 full
Hash=5+2*3=5+6=11Mod 11 =0 full
Hash=5+3*3=5+9=14Mod 1 =3 OK

12
Empty After After After After After After
38 1 16 49 11 60
0 11 11
1 1 1 1 1 1
2
3 60
4
5 38 38 38 38 38 38
6
7
8 49 49 49
9
10 16 16 16 16

Advantages of the method:

1) Records with ID 16, 49, 60 all hashed to position 5 which clashed with where
record 38 had already been placed.
In the case of 16 a position 5 away was tried & was successful .
In the case of 49 a position 7 away then 14 away was tried
In the case of 60 a position 3, 6,then 9 away was tried until successful.

In quadratic hashing all clashes to a position follow exactly the same path away
from this position. In double hashing DIFFERENT paths away from the clash
position are followed. This reduces the length of path that has to be traversed
before an empty slot is found. Naturally not every clash follows a different path.
Some will follow the same path as others.

Disadvantages of the method:

1) the calculation to do the second hash is a bit more complex.

13
REHASHING

Rehashing is the process that needs to be done when the table of entries gets too full.
Normally one rehashes once the table becomes 50% full. The reason for this is that a)
when linear probing is used the number of attempts required to find an empty slot
increases dramatically as the table fills; and b) with quadratic probing the method is only
guaranteed to work is the table is less than 50% filled.

The process of re-hashing is simple but time consuming.

1) A new table is created that is at least double the size of the original table. The usual
choice for the new table size is the first prime number that is at least twice as big as the
old table size. In the example below the old table size was 5. The new table size is 11.
11 is the first prime that is at least double the old table size of 5

2) Go through all the entries in the old table and rehash each one to the new table. In this
example the old hash was h1 = key Mod 5 for the table of size 5. For the new table of size
11 the new hash would be h2 = key Mod 11.

Example:

Old hash: h1 = key Mod 5

Insert 24, 14, 34

key h1 = key Mod 5
24 24 Mod 5 = 4
14 14 Mod 5 = 4 full, try 5 Mod 5 = 0 OK
34 34 Mod 5 = 4 full, try 5 Mod 5 = 0 full, try 0 + 1 = 1 OK
But the table is now more that 50% full with 3 entries in a 5 slot table.

Initially After 24 After 14 After 34

inserted inserted inserted
0 14 14
1 34
2
3
4 24 24 24

Create a new 11 slot table.

Start at the top of the old table and re-hash each entry using h2 = key Mod 11

key h2 = key Mod 11

14 14 Mod 11 = 3 OK
34 34 Mod 11 = 1 OK
24 24 Mod 11 = 2 OK
As you can see the records hash to different positions in the new 11 slot table

14
New 11 slot table after rehashing completed

Initially After 14 After 34 After 24

inserted inserted inserted
0
1 34 34
2 24
3 14 14 14
4
5
6
7
8
9
10

SEPARATE CHAINING

A final method of resolving collisions is to use a table with one entry for every possible
hash value. When a collision occurs a linked list is created with the new record being
inserted at the end of that particular list.

Hash h = Key Mod 11 for a table-size of 11.

Example:
Hash h = Key Mod 11

Key Hash
22 Joe 0
26 Pam 4
39 Zulu 6
48 Henk 4
17 Bob 6

15
The disadvantage of the method is:
a. Space has to be allocated for the pointer within each record
b. It can take considerable time to find the required record.

The advantages of the method are:

c. The Table-size is constant. The lists from each entry grow as entries are
inserted into the table.
d. The number of entries in the table usually exceed the table size. This is
because several values collide and are put onto a list.

ANALYSIS OF SEPARATE CHAINING

Assume a Table size of M

Assume there are N entries.
Load factor x = N/M (=1 typically)

Average number of probes for a SUCCESSFUL search = 1 + x/2

Average number of probes for an UNSUCCESSFUL search = 1 + x

16
Example1:
See example above:
Table size = M = 10
No. entries = N = 5
Load factor = x = N/M = 5/10 = .5 (Average length of each linked list)

Average number of probes for a SUCCESSFUL search = 1 + x/2 = 1 + .5/2 = 1.25

Average number of probes for an UNSUCCESSFUL search = 1 + x = 1 + .5 = 1.5

Example 2:
Table size = M = 10
No. entries = N = 20
Load factor = x = N/M = 20/10 = 2 (Average length of each linked list)

Average number of probes for a SUCCESSFUL search = 1 + x/2 = 1 + 2/2 = 2

Average number of probes for an UNSUCCESSFUL search = 1 + x = 1 + 2 = 3

EVALUATION OF HASH FUNCTIONS

There are 3 issues:

1) Evaluation of characters;
2) Speed of evaluation; and
3) The Overflow problem.

Evaluation of Characters

Up till now a simple hash function has been used where the key has always been a
numeric value. Often we wish to use a name, or part of a name, as the key i.e. Mike.

The Hash is h = (Mike) mod Table-size

We evaluate the string “Mike” by replacing each character by its numeric value.

In the ASCII character set there are 128 characters. 52 letters, special characters +,- etc
and control characters like NULL. The numeric values of the characters are:

A = 65 J = 74 S = 83 a = 97 j = 106 s = 115
B = 66 K = 75 T = 84 b = 98 k = 107 t = 116
C = 67 L = 76 U = 85 c = 99 l = 108 u = 117
D = 68 M = 77 V = 86 d = 100 m = 109 v = 118
E = 69 N = 78 W = 87 e = 101 n = 110 w = 119
F = 70 O = 79 X = 88 f = 102 o = 111 x = 120
G = 71 P = 80 Y = 89 g = 103 p = 112 y = 121
H = 72 Q = 81 Z = 90 h = 104 q = 113 z = 122
I = 73 R = 82 i = 105 r = 114

17
Thus “ Mike” = M * 1283 + i * 1282 + k * 1281 + e * 1280
= 77 * 1283 + 105 * 1282 + 107 * 1281 + 101 * 1280
= 77 * 128 * 128 * 128 + 105 * 128 * 128 + 107 * 128 + 101
= 77 * 2097152 + 1720320 + 13696 + 101
= 161480704 + 1720320 + 13696 + 101
= 163 214 821
Now this is quite a big number. Will it fit into a computer word?
If the computer word is 32 bits long the maximum integer it can hold is:
232 –1 = 4 294 960 000 (approx)

So our hash value will fit in (But if we added a 5th character it would not).

If the computer word is 16 bits long the maximum integer it can hold is:
216 –1 = 65536 –1 = 65535

This is far to short. Most of the contribution from the “M” and the “i” will be lost in
overflow. If the key has more characters in it the situation gets worse. Effectively the
leading characters play only a very small part in the total because they are lost in
overflow.

Speed of Evaluation

As you can see there are quite a number of additions and multiplications to be done in
this calculation. There is a simple way to minimize these.

H = k2 * 1282 + k1 * 1281 + k0
Can be written as
H = k2 * 128 * 128 + k1 * 128 + k0 which requires 3 multiplications and 2 additions.

It can also be written as:

H = ( (k2 * 128) + k1 ) *128 + k0 which only requires 2 multiplications and 2 additions

This formulation is better.

If we had 4 constants then
H = k3 * 128 * 128 * 128 + k2 * 128 * 128 + k1 * 128 + k0
which requires 6 mltiplications and 3 additions.

H =( ( (k 3 * 128) + k2 ) 128 + k1 ) 128) + k0

which only requires 4 multiplications and 3 additions.

This formulation becomes more efficient as the number of characters in the hash increase.
It is an application of what is known as Horner’s rule

Clearly this is the efficient way of performing the computation.

18
Overflow Problem

In the previous section the example of using “Mike” as a key was illustrated. The value
obtained for the hash was very big. It was however small enough to fit into a 32 bit word.
If a key of “Michael” had been used then there would have been a large amount of
overflow. Such a key should however spread the records more evenly over the entire
table.

There are several ways to reduce (or eliminate) overflow. Some of these will now be
described.

Method 1:
Just use UPPER case letters (26 + blank) and the 10 digits. This gives 37 characters in
total rather than 128. Consequently the total of the hash will be smaller and the problem
will be diminished.

H =( ( (k 3 * 37) + k2 ) * 37 + k1 ) * 37) + k0 for 4 character keys.

Method 2:

Just reduce the multiplicative factor. In this instance 37 is reduced to 32 and the Hash
becomes:

H =( ( (k 3 * 32) + k2 ) * 32 + k1 ) * 32) + k0 for 4 character keys.

Clearly the result will be smaller & more likely to fit in. Additionally multiplication by
32 can be done by shifting rather than by multiplication. This is much faster.

The above formulation will work because exactly the same calculation is done to insert a
record as is done to find a record. Using 32 rather than 37 as the multiplicative factor will
certainly compute slightly different hash positions for the same key. Using the 37 will
probably distribute the keys better. Using the 32 will still work, will be faster and will
retain a larger contribution from the characters on the left hand side of the key.

SOME PRACTICAL HASH FUNCTIONS

Assume that a name is to be used as the key.

“Michael Brown” will be used as the example

HASH 1:

Use the first 5 characters of the name: Micha

19
This choice will work. If however there are many names that are the same it will lead to
many clashes

HASH 2:

Use the first 5 characters, excluding vowels, of the name: MchlB

This choice will be better as consonants occur less frequently than vowels so the final
characters chosen should be more random.

HASH 3:

Use the first 3 characters, excluding vowels, of the first name together with the first 2
characters, excluding vowels, of the surname: MchBr

This choice should be even better because of the randomness of the consonants as well as
the randomness of first names and surnames.

HASH 4:

This hash is used by our specific university:

Use the first 3 characters, excluding vowels, of the surname together with the first 3
characters of the first name. If there are not enough consonants or characters use an “x”
BrwMic

In general it is up the designer of the hash function to find a function that spreads the data
to be used as evenly as possible over the table with as few collisions as possible. The
hash functions given above are a few examples. Each person can have fun designing their
own hash function. A word of warning however when you invent your own. Run some
test with your special function & see how many collisions you get for some subset of
your data. Then do the same for one of the above hash functions & see which is best.

QUESTIONS:

1 Explain what is ment by “hashing”.

2 What is ment by a collision?
3 What is primary clustering & why does it occur?
4 What is secondary clustering & why does it occur?
5 What are the advantages & disadvantages of using a long key?
6 What is the advantage of using double hashing as a collision resolution
method?

ANSWERS

20
1 HASHING is a method that uses a function and the key of the record to find a
position to place the record in a table { h(key) = POSITION }. The record must
be inserted, found or removed in CONSTANT time. Generally the range of the
keys is much greater than the size of the table.

2 COLLISION The hash function may produce the same result for several
different keys. A collisions is then said to occur.

3 PRIMARY CLUSTERING can occur when linear probing is used to resolve

collisions. This means that the next available position after the original one is
chosen to place the record. If many records collide they are d as close as possible
to the original position. This causes many cells close together to all be filled.

4 SECONDARY CLUSTERING occurs when quadratic probing is used to

resolve collisions. Exactly the same path away from the original position is used
(1, 4, 9, 16, 25, 36, etc) until an empty slot is found. This exact path is tried until
an empty slot is found. The more collisions there are the longer this path becomes.

5 LONG KEYS
ADVANTAGES : Give a good distribution of keys over the entire table. This
means that the records are evenly distributed over the entire table.
DISADVANTAGES: It takes longer to calculate the hash function. Overflow
problems can occur.

6 DOUBLE HASHING. In quadratic hashing all clashes to a position follow

exactly the same path away from this position. In double hashing DIFFERENT
paths away from the clash position are followed. This reduces the length of path
that has to be traversed before an empty slot is found. Naturally not every clash
follows a different path. Some will follow the same path as others.

PROBLEMS

1) Insert the following records into a table of size 17 using the hash function
h = key mod 17. Linear probing is to be used to resolve collisions. The keys of the
records are: 2, 5, 14, 31, 15, 32, 48, 49

2) Insert the following records into a table of size 13 using the hash function
h = key mod 13. Quadratic probing is to be used to resolve collisions. The keys of the
records are: 1, 2, 5, 14, 10, 23

3) Insert the following records into a table of size 13 using the hash function

21
h = key mod 13. Double hashing is to be used to resolve collisions. The second hash
function is : h2 = 1 + (key Mod 5).
The keys of the records are: 1, 24, 8, 10, 12, 21, 47, 34

SOLUTIONS

Problem1: h= key Mod 17

Key h = key mod 17

2 2 Mod 17 = 2 OK
5 5 Mod 17 = 5 OK
14 14 Mod 17 = 14 OK
31 31 Mod 17 = 14 =>15 OK
15 15 Mod 17 = 15=>16 OK
32 32 Mod 17 = 16=>17 = 0 OK
48 48 Mod 17 = 14=>15=>16=>17 = 0=>1 OK
49 49 Mod 17 = 15=>16=>17 = 0=>1=>2 =>3 OK

Problem2:

1, 2, 5,14, 10, 23

Key h = key Mod 13

1 1 Mod 13 = 1 OK
2 2 Mod 13 = 2 OK
5 5 Mod 13 = 5 OK
14 14 Mod 13 = 1 Full, 1 + 12 = 2 Full, 1 + 22=5 Full, 1 + 32=10 OK
10 10 Mod 13 = 10 Full, 10 + 12 = 11 OK
23 23 Mod 13 = 10 Full, 10 + 12 = 11 Full, 10 + 22=14 Mod 13 = 1 Full,
10 + 32=19 Mod 13 = 6 OK

Problem3:

h = key mod 13.

h2 = 1+(key Mod 5).
The keys of the records are: 1, 24, 8, 10, 12, 21, 47, 34

Key key Mod 13 1 + (key Mod 5).

1 1 Mod 13 = 1 OK
24 24 Mod 13 = 11 OK

22
8 8 Mod 13 = 8 OK
10 10 Mod 13 = 10 OK
12 12 Mod 13 = 12 OK
1 1 Mod 13 = 1 OK
21 21 Mod 13 = 8 Full 1 + 21 Mod 5 = 1 + 1 = 2
8 + 2 = 10 Full
8 + 2 * 2 = 12 Full
8 + 3 * 2 = 14 Mod 13 = 1 Full
8 + 4 * 2 = 16 Mod 13 = 3 OK
47 47 Mod 13 = 8 Full 1 + 47 Mod 5 = 1 + 2 = 3
8 + 3 = 11 Full
8 + 2 * 3 = 14 Mod 13 = 1 Full
8 + 3 * 3 = 17 Mod 13 = 4 OK
34 34 Mod 13 = 8 Full 1 + 34 Mod 5 = 1 + 4 = 5
8 + 5 = 13 Mod 13 = 0 OK

23
This document was created with Win2PDF available at http://www.daneprairie.com.
The unregistered version of Win2PDF is for evaluation or non-commercial use only.

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Hash (4) Cuckoo Hashing
No ratings yet
Hash (4) Cuckoo Hashing
15 pages
Comp 272 Notes
0% (1)
Comp 272 Notes
26 pages
Data Structures and Algorithms: Lecture Notes For
No ratings yet
Data Structures and Algorithms: Lecture Notes For
126 pages
3 - Hashing Functions
No ratings yet
3 - Hashing Functions
20 pages
Hashing PPT For Student
No ratings yet
Hashing PPT For Student
53 pages
Hashing in Data Structures
No ratings yet
Hashing in Data Structures
27 pages
Coding Problems
0% (1)
Coding Problems
75 pages
Data Structure - UNIT-1
No ratings yet
Data Structure - UNIT-1
28 pages
Hashing and Indexing
No ratings yet
Hashing and Indexing
28 pages
Analysis of Algorithms CS 477/677: Hashing Instructor: George Bebis
No ratings yet
Analysis of Algorithms CS 477/677: Hashing Instructor: George Bebis
53 pages
AA Trees PDF
No ratings yet
AA Trees PDF
55 pages
Design and Analysis of Algorithms Lab
No ratings yet
Design and Analysis of Algorithms Lab
62 pages
Unit 3 Storage Strategies Indices B-Trees Hashing
No ratings yet
Unit 3 Storage Strategies Indices B-Trees Hashing
12 pages
CS8391 Data Structures
No ratings yet
CS8391 Data Structures
1 page
File Allocation Methods
No ratings yet
File Allocation Methods
9 pages
Avl Trees
100% (1)
Avl Trees
25 pages
Data Structure Management
No ratings yet
Data Structure Management
3 pages
Department of IT, Panimalar Engineering College, Chennai
No ratings yet
Department of IT, Panimalar Engineering College, Chennai
32 pages
Asymptotic Time Complexity
No ratings yet
Asymptotic Time Complexity
2 pages
UNIT1 Data Structures
No ratings yet
UNIT1 Data Structures
15 pages
CB19241-CSBS-Data Structure and Algorithms-Theory-Syllabus
No ratings yet
CB19241-CSBS-Data Structure and Algorithms-Theory-Syllabus
4 pages
Unit2 Skiplist
No ratings yet
Unit2 Skiplist
10 pages
M.Tech JNTUK ADS UNIT-3
No ratings yet
M.Tech JNTUK ADS UNIT-3
13 pages
FDS Unit 5
No ratings yet
FDS Unit 5
22 pages
Question Bank PDF
No ratings yet
Question Bank PDF
10 pages
DS in 7 Hours
No ratings yet
DS in 7 Hours
346 pages
Database Management System PDF
No ratings yet
Database Management System PDF
228 pages
CSPC2002
No ratings yet
CSPC2002
2 pages
Cd3291 Dsa Unit 5 Notes Eduengg
No ratings yet
Cd3291 Dsa Unit 5 Notes Eduengg
23 pages
Algorithm Analysis Design Lecture1 PowerPoint Presentation
No ratings yet
Algorithm Analysis Design Lecture1 PowerPoint Presentation
9 pages
AVL Tree
No ratings yet
AVL Tree
34 pages
DSAD Dynamic Hashing
No ratings yet
DSAD Dynamic Hashing
79 pages
Ece FDS Notes - Unit I
No ratings yet
Ece FDS Notes - Unit I
20 pages
C Operator Precedence and Associativity
100% (2)
C Operator Precedence and Associativity
2 pages
Data Structure & Algorithms - Tower of Hanoi
100% (1)
Data Structure & Algorithms - Tower of Hanoi
3 pages
Data Structures Lab Manual
100% (1)
Data Structures Lab Manual
54 pages
Jntuk Ads Lab Manual
50% (2)
Jntuk Ads Lab Manual
27 pages
Data Structure
No ratings yet
Data Structure
13 pages
Divide and Conquer Strategy
No ratings yet
Divide and Conquer Strategy
33 pages
CAT Questions
No ratings yet
CAT Questions
2 pages
Unit 2 - Data Structure - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Data Structure - WWW - Rgpvnotes.in
22 pages
Simplified DES
No ratings yet
Simplified DES
13 pages
DSA_BCS304
No ratings yet
DSA_BCS304
29 pages
Data Structures Question Bank
No ratings yet
Data Structures Question Bank
4 pages
S J C Institute of Technology Dept. of Computer Science & Engineering SUB: Data Structures & Applications (18CS32) Class: Tutorial 3
No ratings yet
S J C Institute of Technology Dept. of Computer Science & Engineering SUB: Data Structures & Applications (18CS32) Class: Tutorial 3
2 pages
Chap 12. Extendible Hashing: File Structures
No ratings yet
Chap 12. Extendible Hashing: File Structures
40 pages
CS8591 Computer Networks L T P C 3 0 0 3 Objectives
0% (1)
CS8591 Computer Networks L T P C 3 0 0 3 Objectives
5 pages
Whole Dsa Notes Made by Me
100% (1)
Whole Dsa Notes Made by Me
411 pages
OOP - I GTU Study Material Presentations Unit-1 07022022102854PM
No ratings yet
OOP - I GTU Study Material Presentations Unit-1 07022022102854PM
59 pages
Binary Trees
No ratings yet
Binary Trees
54 pages
Lecture 1
No ratings yet
Lecture 1
26 pages
Hashing
50% (2)
Hashing
43 pages
Selection Sort
No ratings yet
Selection Sort
20 pages
Data Structures Part - A (Shortanswer Questions) : Vemu Institute of Technology
No ratings yet
Data Structures Part - A (Shortanswer Questions) : Vemu Institute of Technology
6 pages
Hash Data Structure
No ratings yet
Hash Data Structure
18 pages
KMP String Matching Algorithm
No ratings yet
KMP String Matching Algorithm
8 pages
C Notes
No ratings yet
C Notes
158 pages
Module 3 DAA
No ratings yet
Module 3 DAA
16 pages
DSPD Cse-4 Course File 09-10
No ratings yet
DSPD Cse-4 Course File 09-10
34 pages
Advanced Unix Programming
From Everand
Advanced Unix Programming
Prof. N. B Venkateswarlu
No ratings yet
Introduction To Classes and Objects: Objectives
No ratings yet
Introduction To Classes and Objects: Objectives
38 pages
Xampp: Manas Bapna
No ratings yet
Xampp: Manas Bapna
11 pages
Object-Oriented Software Development - Tutorial 4
No ratings yet
Object-Oriented Software Development - Tutorial 4
20 pages
Control Statements: Part 2: Objectives
No ratings yet
Control Statements: Part 2: Objectives
32 pages
Defogger
No ratings yet
Defogger
56 pages
Control Statements: Part 1: Let's All Move One Place On
No ratings yet
Control Statements: Part 1: Let's All Move One Place On
40 pages
jhtp7LM 13
No ratings yet
jhtp7LM 13
21 pages
VB 02
No ratings yet
VB 02
43 pages
VB 03
No ratings yet
VB 03
76 pages
BIS4222 CW Sem 2 02
No ratings yet
BIS4222 CW Sem 2 02
6 pages
BIS4222 Database Systems: Middlesex University
No ratings yet
BIS4222 Database Systems: Middlesex University
7 pages
VB 07
No ratings yet
VB 07
38 pages
ATC and AIS Aspects
No ratings yet
ATC and AIS Aspects
12 pages
UNIT 6 - Privacy
No ratings yet
UNIT 6 - Privacy
20 pages
Priority Queues
No ratings yet
Priority Queues
25 pages
Data Structures and Algorithms: 1 Algorithm Analysis and Recursion
No ratings yet
Data Structures and Algorithms: 1 Algorithm Analysis and Recursion
37 pages
UNIT 7 - The Social Implications of Artificial Intelligence and Expert Systems
No ratings yet
UNIT 7 - The Social Implications of Artificial Intelligence and Expert Systems
11 pages
UNIT 9 - Computers in The Workplace
No ratings yet
UNIT 9 - Computers in The Workplace
13 pages
UNIT 8 - Failed Computer Software and Information System Projects
No ratings yet
UNIT 8 - Failed Computer Software and Information System Projects
14 pages
Unit: Advanced Programming
No ratings yet
Unit: Advanced Programming
4 pages
UNIT 10 - Evaluating The BCS Code of Conduct and The ACM Code of Ethics and Professional Conduct
No ratings yet
UNIT 10 - Evaluating The BCS Code of Conduct and The ACM Code of Ethics and Professional Conduct
16 pages
UNIT 3 - Intellectual Property Rights: HE Orld Ntellectual Roperty Rganisation Ichard Pinello Omputer Thicist
No ratings yet
UNIT 3 - Intellectual Property Rights: HE Orld Ntellectual Roperty Rganisation Ichard Pinello Omputer Thicist
16 pages
UNIT 4 - Computer Crime: Omputer Heft Nauthorised SE Ccess Odification HE Omputer Isuse CT
No ratings yet
UNIT 4 - Computer Crime: Omputer Heft Nauthorised SE Ccess Odification HE Omputer Isuse CT
9 pages
UNIT 5 - Computers and Democracy (Freedom of Speech Debate)
No ratings yet
UNIT 5 - Computers and Democracy (Freedom of Speech Debate)
16 pages
UNIT 11 - Enabling and Disabling Effects of Information Technology
No ratings yet
UNIT 11 - Enabling and Disabling Effects of Information Technology
9 pages
UNIT 2 - Unauthorised Access (Computer Hacking) : Ineteen Ighty OUR
No ratings yet
UNIT 2 - Unauthorised Access (Computer Hacking) : Ineteen Ighty OUR
12 pages
Cs 402 Analysis
No ratings yet
Cs 402 Analysis
4 pages
Greedy Algorithm For Matroids
No ratings yet
Greedy Algorithm For Matroids
15 pages
Hashing: Data Structures and Algorithms in Java
No ratings yet
Hashing: Data Structures and Algorithms in Java
78 pages
Mid-Term Examination (100 Points) Open Textbook and Open Notes Only Winter 2003
No ratings yet
Mid-Term Examination (100 Points) Open Textbook and Open Notes Only Winter 2003
5 pages
DSA - Searching PDF
100% (1)
DSA - Searching PDF
50 pages
Unit 2 - Searching & Game Playing
No ratings yet
Unit 2 - Searching & Game Playing
28 pages
19MID0068 - Adv Algo - ETH DA-1
No ratings yet
19MID0068 - Adv Algo - ETH DA-1
14 pages
C Blind Search
No ratings yet
C Blind Search
63 pages
Module 5 Hashing
No ratings yet
Module 5 Hashing
66 pages
Cs PPT CHP 3 Part 4
No ratings yet
Cs PPT CHP 3 Part 4
37 pages
Pickup and Delivery Problem With Time Windows
No ratings yet
Pickup and Delivery Problem With Time Windows
15 pages
Slides - Chapter 32 - String Matching
No ratings yet
Slides - Chapter 32 - String Matching
18 pages
Tugas OR II - 1810631140003 - Rahman Saleh - Kelas A
No ratings yet
Tugas OR II - 1810631140003 - Rahman Saleh - Kelas A
39 pages
A Computational Study of The Pseudoflow and Push-Relabel Algorithms For The Maximum Flow Problem PDF
No ratings yet
A Computational Study of The Pseudoflow and Push-Relabel Algorithms For The Maximum Flow Problem PDF
19 pages
Sha 256
100% (1)
Sha 256
199 pages
Deleted 31421 - Aritificial Intelligence Cheat Sheet
No ratings yet
Deleted 31421 - Aritificial Intelligence Cheat Sheet
2 pages
03 InformedHeuristicSearch
No ratings yet
03 InformedHeuristicSearch
80 pages
Unit-1-Hashing Mcqs
No ratings yet
Unit-1-Hashing Mcqs
39 pages
Multiple Path Pruning, Iterative Deepening and IDA : Alan Mackworth UBC CS 322 - Search 7 January 23, 2013
No ratings yet
Multiple Path Pruning, Iterative Deepening and IDA : Alan Mackworth UBC CS 322 - Search 7 January 23, 2013
37 pages
DS Assignment PDF
No ratings yet
DS Assignment PDF
6 pages
Lecture#8 - String Matching Algorithm
No ratings yet
Lecture#8 - String Matching Algorithm
38 pages
Finding Optimal Path: By: Dr. Anjali Diwan
No ratings yet
Finding Optimal Path: By: Dr. Anjali Diwan
40 pages
Big O Sheet
No ratings yet
Big O Sheet
2 pages
SHA-512 Overall Structure
No ratings yet
SHA-512 Overall Structure
5 pages
AI Sheet 3 - Problem Solving As Search (Heuristic Search - Adversarial Search) PDF
No ratings yet
AI Sheet 3 - Problem Solving As Search (Heuristic Search - Adversarial Search) PDF
2 pages
Search Algorithms in AI
No ratings yet
Search Algorithms in AI
7 pages
Bubble Sort
No ratings yet
Bubble Sort
11 pages
Test3 CSE205 - K18FR
No ratings yet
Test3 CSE205 - K18FR
7 pages
Rabin-Karp Algorithm
No ratings yet
Rabin-Karp Algorithm
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Hashing

Uploaded by

Hashing

Uploaded by

HASHING

10 Mod 10 = 0 (1 *10 + 0 = 10) (No remainder)

From these two functions:

Problem: Assume that we have records about exactly 10 people.

The ID numbers of the 10 people are respectively 0,1,2,3,4,5,6,7,8,9.

ID number Name Tel no

The Hash function h(key) is simply the trivial function:

h(key) = key = Row number in the array

The advantages of this method are:

The disadvantage of this method is:

NOT SO TRIVIAL EXAMPLE

What if the ID numbers of the people were: 0, 1, 2, 3, 4, 5, 6, 7, 8, 999

We choose the same method of storage giving;

ID number Name Tel no

The advantage of this method is:

The disadvantage of this method is that the:

A SIMPLE HASH FUNCTION

As before we wish to store the following 4 records in a table:

Now we choose a more complex Hash function: h = (key) MOD 11

To insert the 1st record: 81 John 12345

So we insert the record in position (or row) 4 of the table

h = (64) Mod 11 = 9 Thus this record is inserted in position 9.

To insert the next record: 39 Mary 34567

h = (39) Mod 11 = 6 This record is inserted in position 6 of the table

Let us consider what has happened here:

Lets continue the example:

Insert: 20 Joe 1342

We attempt to insert this record into position 9. We cant because it is full.

Now insert the record: 31 Zoe 1328

We attempt to insert this record into position 9. We cant because it is full.

The advantages of this method are:

The disadvantages of the method are:

Hash = ( key Mod 11+ i i=0,1,2…) Mod 11

Empty After 51 After 62 After 74 After 19 After 73

Empty After 51 After 62 After 74 After 19 After 73 After

HASHING with OPEN ADDRESSING -- QUADRATIC PROBING

The next strategy is to use f(i) = i2

Hash = (h(x) + f(i) i=0,1,2…) Mod (Table-size)

Hash = ( key Mod 11 + i2 i=0,1,2…) Mod 11

Empty After After After After After After After After

Advantages of the method:

Disadvantages of the method:

HASHING with OPEN ADDRESSING -- DOUBLE HASHING

The next strategy is to use f(i) = i * h2(x)

Hash = (h(x) + f(i) i=0,1,2…) Mod (Table-size) becomes

Hash = ( h1(x) + i * h2(x) i=0,1,2…) Mod (Table-size)

Key h1(x) h2(x) = 7 – (key mod 7) h1(x) + i * h2(x) i=1,2,3..

Advantages of the method:

Disadvantages of the method:

The process of re-hashing is simple but time consuming.

Old hash: h1 = key Mod 5

Insert 24, 14, 34

Initially After 24 After 14 After 34

Create a new 11 slot table.

key h2 = key Mod 11

Initially After 14 After 34 After 24

Hash h = Key Mod 11 for a table-size of 11.

The advantages of the method are:

ANALYSIS OF SEPARATE CHAINING

Assume a Table size of M

Average number of probes for a SUCCESSFUL search = 1 + x/2

Average number of probes for a SUCCESSFUL search = 1 + x/2 = 1 + .5/2 = 1.25

Average number of probes for a SUCCESSFUL search = 1 + x/2 = 1 + 2/2 = 2

EVALUATION OF HASH FUNCTIONS

There are 3 issues:

The Hash is h = (Mike) mod Table-size

It can also be written as:

This formulation is better.

H =( ( (k 3 * 128) + k2 ) *128 + k1 ) * 128) + k0

Clearly this is the efficient way of performing the computation.

H =( ( (k 3 * 37) + k2 ) * 37 + k1 ) * 37) + k0 for 4 character keys.

H =( ( (k 3 * 32) + k2 ) * 32 + k1 ) * 32) + k0 for 4 character keys.

SOME PRACTICAL HASH FUNCTIONS

Assume that a name is to be used as the key.

“Michael Brown” will be used as the example

Use the first 5 characters of the name: Micha

H =( ( (k 3 * 128) + k2 ) 128 + k1 ) 128) + k0