0% found this document useful (0 votes)
164 views

Hashing

Hashing is a technique for inserting or retrieving records from a table in constant time by using a hash function to map a key to a specific position in the table. Collisions occur when the hash function produces the same result for different keys. There are four main solutions to resolve collisions: (1) open addressing, which involves probing to the next empty slot if a collision occurs; (2) separate chaining, which uses linked lists at each table position to handle collisions; (3) linear probing, which sequentially searches the next table positions if a collision occurs; and (4) quadratic probing, which searches positions that are farther apart if a collision occurs. The size of the hash table and choice of hash function are important
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
164 views

Hashing

Hashing is a technique for inserting or retrieving records from a table in constant time by using a hash function to map a key to a specific position in the table. Collisions occur when the hash function produces the same result for different keys. There are four main solutions to resolve collisions: (1) open addressing, which involves probing to the next empty slot if a collision occurs; (2) separate chaining, which uses linked lists at each table position to handle collisions; (3) linear probing, which sequentially searches the next table positions if a collision occurs; and (4) quadratic probing, which searches positions that are farther apart if a collision occurs. The size of the hash table and choice of hash function are important
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

HASHING

INTRODUCTION

Hashing is the term used for inserting or retrieving a record into a table (or file) in
constant time.

In this Chapter some commonly used terms are defined. Hashing is explained in greater
detail. Firstly a trivial example is done to set the scene. This example is then extended to
real situations. It is possible for the Hash function to produce the same result for different
keys. This phenomenon is known as a collision. Four different solutions for resolving
collisions are discussed and compared.

Using keys comprising characters is discussed. Also computational issues that arise when
large keys are used are discussed and solutions given

Finally some commonly used hash functions are described and discussed.

TERMS USED

KEY: The key is some data associated with the record that UNIQUELY identifies the
specific record. ( ie The record may consist of: Name, Address, ID number. The ID
number is UNIQUE (no 2 people have the same ID) thus it can be used as the key of the
record).

X DIV Y: The DIV function returns the integer value equal to the number of times Y
divides into X. Any remainder is disreguarded. (X and Y are integers).

Eg 12 DIV 3 = 4 (Exact)
13 DIV 3 = 4 (Remainder 1, which is disreguarded)
14 DIV 3 = 4 (Remainder 2, which is disreguarded)
15 DIV 3 = 5

X MOD Y: The MOD function returns the remainder after division of X by Y. (X and
Y are integers).

Eg 27 Mod 10 = 7 (2 * 10 + 7 = 27)
17 Mod 10 = 7
36 Mod 10 = 6 (3 * 10 + 6 = 26)
5 Mod 10 = 5 (0 * 10 + 5 = 5)

10 Mod 10 = 0 (1 *10 + 0 = 10) (No remainder)

1
30 Mod 10 = 0 30 Mod 11 = 8
30 Mod 13 = 4 26 Mod 13 = 0 131 Mod 13 = 1

From these two functions:

X = (X DIV Y) * Y + (X MOD Y)

HASH FUNCTION h: h(key) = POSITION in which to place the record in the table.
This is a function, that uses the key of the record, to calculate the position in which to
place the record in the table.

COLLISION: The hash function may produce the same result for several different keys.
Collisions are then said to occur. These will be resolved a little later on.

TRIVIAL EXAMPLE

Problem: Assume that we have records about exactly 10 people.


Each record is comprised of: ID Number, Name and Telephone Number.

The ID numbers of the 10 people are respectively 0,1,2,3,4,5,6,7,8,9.

The ideal data structure is an array of records called TABLE as shown below:

ID number Name Tel no


0 Mike 2341
1 Pam 2378
2 John 1289
3 Sally 3478
4 Tom 3478
5 Peter 2378
6 Colin 7691
7 Janet 1479
8 Mary 8643
9 Pam 9876

The Hash function h(key) is simply the trivial function:

h(key) = key = Row number in the array

In other words if the key is 3 then you will find the desired record in the 4th row of the
array (remember that arrays start at row 0).

The advantages of this method are:

2
1) we can insert, find or delete a record in CONSTANT time (it is just 1 operation);
and
2) the table is exactly the correct size (neither too big nor too small).

The disadvantage of this method is:


1) if we wish to add an 11th person it’s a hassle. We would have to define a new
array of size 11. We would have to re-hash all the records from our old table to
the new one. Finally we could add the 11th record to the new table.

In this example it is not necessary to store the ID number because it is the same as the
row index of the table. Later on, in more complicated cases where there can be clashes, it
is essential to keep the key because we need to check that we have found the correct
record.

NOT SO TRIVIAL EXAMPLE

What if the ID numbers of the people were: 0, 1, 2, 3, 4, 5, 6, 7, 8, 999

There are still only 10 students but their ID numbers range from 0 to 999.

We choose the same method of storage giving;

ID number Name Tel no


0 Mike 2341
1 Pam 2378
2 John 1289
3 Sally 3478
4 Tom 3478
5 Peter 2378
6 Colin 7691
7 Janet 1479
8 Mary 8643
… … …
… … …
… … …
999 Pam 9876

The advantage of this method is:


1) we can insert, find or delete a record in CONSTANT time.

The disadvantage of this method is that the:

3
1) Table size is much greater than the number of entries in the table. The Table size
is 1000 and the number of entries is only 10. This means 99% of the table is
wasted space which is very poor space utilization

The challenge is to save space. This is achieved by using a more sophisticated Hash
function. More about this shortly.

In real life there are many problems that show the behaviour above where the number of
entries in the table is far smaller than the maximum entry key. Often however the entries
are more randomly distributed over the table. For example the 10 entries may have the
ID’s of: 2, 45, 67, 123, 345, 346, 459, 554, 721, 999 .

A SIMPLE HASH FUNCTION

As before we wish to store the following 4 records in a table:

ID Name Tel No
key
81 John 12345
64 Sally 23456
39 Mary 34567
46 Philimon 87654

Now we choose a more complex Hash function: h = (key) MOD 11

To insert the 1st record: 81 John 12345

Hash h = (81) Mod 11 = 4 (11 divides into 81 seven times (77) with a remainder of 4)

So we insert the record in position (or row) 4 of the table

ID Name Tel No
0
1
2
3
4 81 John 12345
5
6
7
8
9
10
Position Key

4
To insert the next record: 64 Sally 23456

h = (64) Mod 11 = 9 Thus this record is inserted in position 9.

ID Name Tel No
0
1
2
3
4 81 John 12345
5
6
7
8
9 64 Sally 23456
10
Position Key

To insert the next record: 39 Mary 34567

h = (39) Mod 11 = 6 This record is inserted in position 6 of the table

ID Name Tel No
0
1
2
3
4 81 John 12345
5
6 39 Mary 34567
7
8
9 64 Sally 23456
10
Position Key

5
To insert the next record: 46 Philimon 87654

h = (46) Mod 11 = 2

ID Name Tel No
0
1
2 46 Philimon 87654
3
4 81 John 12345
5
6 39 Mary 34567
7
8
9 64 Sally 23456
10
Position Key

Let us consider what has happened here:


1) The Hash function for each entry was calculated and the record inserted in the
table in that position;
2) This example specifically contained no collisions. In other words each entry
hashed to a unique position. This is artificial and I will show next what to do
when collisions occur;
3) I chose a table size at least double the number of records to be entered. This is
to ensure that the method works relatively efficiently;
4) The table size was specifically chosen as a prime number. In later examples
the method is only guaranteed to work when the table size is a prime.

* IMPORTANT NOTE: From now on I will omit the Name & Tel No information from
the table. In practice it is there. This is just to save time and space in future examples

ID
0
1
2 46
3
4 81
5
6 39
7
8
9 64
10
Position Key

6
Collisions

Life is not perfect with Hashing. The Hashing function can produce the same position to
place two or more records. Clearly this is unacceptable. The simplest solution when this
happens is just to try the next position in the table & so on until an empty slot is found.

Lets continue the example:

Insert: 20 Joe 1342

h = (20) Mod 11 = 9

We attempt to insert this record into position 9. We cant because it is full.


Add 1 to the position i.e. position 10 & try there. Fine it is empty so insert the record
there (Note I am omitting the Name & Tel No from the description).

ID
0
1
2 46
3
4 81
5
6 39
7
8
9 64
10 20
Position Key

Now insert the record: 31 Zoe 1328

h = (31) Mod 11 = 9

We attempt to insert this record into position 9. We cant because it is full.


Add 1 to the position i.e. position 10 & try there. No good it is full.
Add 1 to the position i.e. position 11. Well position 11 does not exist. It is off the end of
the table. So try position 0 at the start of the table instead. It is empty so insert the record.
there. (If it was full you would try position 1 & so on until you find an empty position).

The mathematical description of going beyond the end of the table & wrapping around to
the start to the table can be simply written as Mod (Table-size) or Mod 11 in this case.

7
The final table is:

ID
0 31
1
2 46
3
4 81
5
6 39
7
8
9 64
10 20
Position Key

The advantages of this method are:


1) The method is simple to implement; and
2) As long as there is an empty slot in the table a new record can be inserted.

The disadvantages of the method are:


1) It can be inefficient. It often occurs that a large blocks of fully occupied cells
form inside the table. This is called Primary Clustering. When a new record
is hashed to some value in this full cluster it takes many attempts before an
empty slot can be found. This is illustrated in the next example.
2) If the table is 50% full then it can be shown that an empty slot can be found
after 2.5 attempts. If the table is 90% full then some 50 attempts are required
to find an empty slot.
3) To keep the method efficient we must keep the table at most half full. Once
this happens a new table must be set up. Its size needs to be the 1st prime
number that is at least double the old table size. Then we must re-hash all the
records in the old table into the new table (This is a lot of work). Then we can
carry on.
4) Deletion problem. This occurs when a record is deleted from the table and this
record is on the path to some other record that was subsequently inserted. The
search hits an empty slot, stops and reports that the record is not there. In fact
it is but the method can not find it. The solution to this problem is to mark
each position as EMPTY, FULL or DELETED. When we a searching for a
record if the position is marked as DELETED the search continues. The
search only stops when the record is found or a EMPTY slot is found. This is
illustrated in the next example.

8
HASHING with OPEN ADDRESSING -- LINEAR PROBING

What I have described is Hashing with open addressing with collisions resolved by
Linear probing.

This can be written in a general form as: Hash = (h(x) + f(i) i=0,1,2…) Mod (Table-size)

In the next example: h(x) = key Mod 11, the Table-size = 11 and f(i) = i, giving

Hash = ( key Mod 11+ i i=0,1,2…) Mod 11

For illustration purposes 5 records are hashed into the table. These records have the keys:
51, 62, 74, 19 ,73

Key Hash
51 7 OK
62 7 => (7 + 1) = 8 OK
74 8 => (8 + 1) = 9 OK
19 8 => 9 =>10 OK
73 7 => 8 => 9 => 10 =>11 = 0 OK

Empty After 51 After 62 After 74 After 19 After 73


0 73
1
2
3
4
5
6
7 51 51 51 51 51
8 62 62 62 62
9 74 74 74
10 19 19

Now lets FIND record 73. We use exactly the same procedure as inserting the record.
Key Hash
73 7 Is the key in position 7 = 73 (51<>73) No Try next position
8 Is the key in position 8 = 73 (62<>73) No Try next position
9 Is the key in position 9 = 73 (74<>73) No Try next position
10 Is the key in position 10 = 73 (19<>73) No Try next position
0 Is the key in position 0 = 73 (67<>73) Yes Found
Right this example WORKS.

9
Now : DELETE record 74. The resulting table is:

Empty After 51 After 62 After 74 After 19 After 73 After


74
Deleted
0 73 73
1
2
3
4
5
6
7 51 51 51 51 51 51
8 62 62 62 62 62
9
10 19 19 19

Lets try to FIND record 73. We use exactly the same procedure as inserting the record.

Key Hash
73 7 Is the key in position 7 = 73 (51<>73) No Try next position
8 Is the key in position 8 = 73 (62<>73) No Try next position
9 Record EMPTY thus stop search => record NOT found. This is incorrect.

WE needed to mark this record as DELETED so the search can continue on to position
10 then 0 where the record is found.

HASHING with OPEN ADDRESSING -- QUADRATIC PROBING

The next strategy is to use f(i) = i2

We try 12 =1 position away from the original hash, then 22 = 4 positions away, then 32 =
9 positions away & so on.

Hash = (h(x) + f(i) i=0,1,2…) Mod (Table-size)

In the next example: h(x) = key Mod 11, the Table-size = 11 and f(i) = i2
Which gives;

Hash = ( key Mod 11 + i2 i=0,1,2…) Mod 11

10
For illustration purposes 6 records are hashed into the table. These records have the keys:
51, 62, 74, 19, 73

Key Hash
51 7 OK
62 7 => (7 + 12) = 8 OK
74 8 => (8 + 12) = 9 OK
19 8 => (8 + 1 ) = 9 => (8 + 22) = 12 Mod(11) = 1
2
OK
73 7 => (7 + 12) = 8 => (7 + 22) = 11 Mod(11) = 0 OK
2 2 2
84 7 => (7 + 1 ) = 8 => (7 + 2 ) = 11 Mod(11) = 0 => (7 + 3 ) = 16 Mod(11) =5 OK
95 7 => (7 + 12) = 8 => (7 + 22) = 11 Mod(11) = 0 => (7 + 32) = 16 Mod(11) =5
=> (7 + 42) = 23 Mod(11) = 1 => (7 + 52) = 32 Mod(11) = 10 OK
106 7 Now try it yourself – you will discover that while there are still empty
slots in the table the hash function wont find them – POTENTIAL DISASTER! Don’t
worry there is a solution.

Empty After After After After After After After After


51 62 74 19 73 77 84 95
0 73 73 73 73
1 19 19 19 19 19
2
3
4
5 84 84
6
7 51 51 15 51 51 51 51 51
8 62 62 62 62 62 62 62
9 74 74 74 74 74 74
10 95

Advantages of the method:


1) Primary clustering is reduced & effectively eliminated.

Disadvantages of the method:


1) the calculation to do the hash is a bit more complex;
2) the table can still have empty slots but the hash function & collision resolution
does not find them. This problem can be solved in a simple fashion. The table size
must be a prime number and the table MUST be kept less than half full. When
these two conditions are met it can be proved that a record will be placed in the
table. Once this happens a new table must be set up. Its size needs to be the 1st
prime number that is at least double the old table size. Then we must re-hash all
the records in the old table into the new table (This is a lot of work). Then we can
carry on inserting further values.

11
3) While quadratic probing is better than linear probing it still has a disadvantage.
Every record that hashes to the same position ( 62, 73, 84 ,95 ) all follow exactly
the same path away from the original position occupied by 51. As more records t
hash to this position the longer the collision resolution process takes to find an
empty slot. This is known as secondary clustering

HASHING with OPEN ADDRESSING -- DOUBLE HASHING

The next strategy is to use f(i) = i * h2(x)

When we have a collision we use a second hash function to determine the distance away
from the first position to try to placethe record, if that fails we move that distance again &
so on i.e.
We try at:
h1(x)
h1(x) + h2(x)
h1(x) + 2 * h2(x)
h1(x) + 3 * h2(x); and so on

Hash = (h(x) + f(i) i=0,1,2…) Mod (Table-size) becomes

Hash = ( h1(x) + i * h2(x) i=0,1,2…) Mod (Table-size)

To ensure that this method words the following conditions must hold:
1) h2(x) must NOT be ZERO ( If it was ever zero then not other positions would ever
be tried)
2) All cells must be capable of being tried. The following formula for h2(x) is a
good one:
h2(x) = R – (x mod R) where R is a prime number less than the Table-size

In the next example: h1(x) = key Mod 11, the Table-size = 11 and
h2(x) = 7 – (key mod 7)

The following 6 records with keys: 38 , 1, 16 ,49, 11, 60 will be hashed to the table.

Key h1(x) h2(x) = 7 – (key mod 7) h1(x) + i * h2(x) i=1,2,3..


38 38 Mod 11 = 5 OK
1 1 Mod 11 = 1 OK
16 16 Mod 11 = 5 full 7- 16 Mod 7=7–2=5 Hash=5+5=10 OK
49 49 Mod 11 = 5 full 7- 49 Mod 7=7–0=7 Hash=5+7=12 Mod 11 = 1 full
Hash=5+14=19 Mod 11 =8 OK
11 11 Mod 11 = 0 OK
60 60 Mod 11 = 5 full 7- 60 Mod 7=7-4=3 Hash=5+3=8 full
Hash=5+2*3=5+6=11Mod 11 =0 full
Hash=5+3*3=5+9=14Mod 1 =3 OK

12
Empty After After After After After After
38 1 16 49 11 60
0 11 11
1 1 1 1 1 1
2
3 60
4
5 38 38 38 38 38 38
6
7
8 49 49 49
9
10 16 16 16 16

Advantages of the method:


1) Records with ID 16, 49, 60 all hashed to position 5 which clashed with where
record 38 had already been placed.
In the case of 16 a position 5 away was tried & was successful .
In the case of 49 a position 7 away then 14 away was tried
In the case of 60 a position 3, 6,then 9 away was tried until successful.

In quadratic hashing all clashes to a position follow exactly the same path away
from this position. In double hashing DIFFERENT paths away from the clash
position are followed. This reduces the length of path that has to be traversed
before an empty slot is found. Naturally not every clash follows a different path.
Some will follow the same path as others.

Disadvantages of the method:


1) the calculation to do the second hash is a bit more complex.

13
REHASHING

Rehashing is the process that needs to be done when the table of entries gets too full.
Normally one rehashes once the table becomes 50% full. The reason for this is that a)
when linear probing is used the number of attempts required to find an empty slot
increases dramatically as the table fills; and b) with quadratic probing the method is only
guaranteed to work is the table is less than 50% filled.

The process of re-hashing is simple but time consuming.


1) A new table is created that is at least double the size of the original table. The usual
choice for the new table size is the first prime number that is at least twice as big as the
old table size. In the example below the old table size was 5. The new table size is 11.
11 is the first prime that is at least double the old table size of 5

2) Go through all the entries in the old table and rehash each one to the new table. In this
example the old hash was h1 = key Mod 5 for the table of size 5. For the new table of size
11 the new hash would be h2 = key Mod 11.

Example:

Old hash: h1 = key Mod 5

Insert 24, 14, 34


key h1 = key Mod 5
24 24 Mod 5 = 4
14 14 Mod 5 = 4 full, try 5 Mod 5 = 0 OK
34 34 Mod 5 = 4 full, try 5 Mod 5 = 0 full, try 0 + 1 = 1 OK
But the table is now more that 50% full with 3 entries in a 5 slot table.

Initially After 24 After 14 After 34


inserted inserted inserted
0 14 14
1 34
2
3
4 24 24 24

Create a new 11 slot table.


Start at the top of the old table and re-hash each entry using h2 = key Mod 11

key h2 = key Mod 11


14 14 Mod 11 = 3 OK
34 34 Mod 11 = 1 OK
24 24 Mod 11 = 2 OK
As you can see the records hash to different positions in the new 11 slot table

14
New 11 slot table after rehashing completed

Initially After 14 After 34 After 24


inserted inserted inserted
0
1 34 34
2 24
3 14 14 14
4
5
6
7
8
9
10

SEPARATE CHAINING

A final method of resolving collisions is to use a table with one entry for every possible
hash value. When a collision occurs a linked list is created with the new record being
inserted at the end of that particular list.

Hash h = Key Mod 11 for a table-size of 11.

Example:
Hash h = Key Mod 11

Key Hash
22 Joe 0
26 Pam 4
39 Zulu 6
48 Henk 4
17 Bob 6

15
The disadvantage of the method is:
a. Space has to be allocated for the pointer within each record
b. It can take considerable time to find the required record.

The advantages of the method are:


c. The Table-size is constant. The lists from each entry grow as entries are
inserted into the table.
d. The number of entries in the table usually exceed the table size. This is
because several values collide and are put onto a list.

ANALYSIS OF SEPARATE CHAINING

Assume a Table size of M


Assume there are N entries.
Load factor x = N/M (=1 typically)

Average number of probes for a SUCCESSFUL search = 1 + x/2


Average number of probes for an UNSUCCESSFUL search = 1 + x

16
Example1:
See example above:
Table size = M = 10
No. entries = N = 5
Load factor = x = N/M = 5/10 = .5 (Average length of each linked list)

Average number of probes for a SUCCESSFUL search = 1 + x/2 = 1 + .5/2 = 1.25


Average number of probes for an UNSUCCESSFUL search = 1 + x = 1 + .5 = 1.5

Example 2:
Table size = M = 10
No. entries = N = 20
Load factor = x = N/M = 20/10 = 2 (Average length of each linked list)

Average number of probes for a SUCCESSFUL search = 1 + x/2 = 1 + 2/2 = 2


Average number of probes for an UNSUCCESSFUL search = 1 + x = 1 + 2 = 3

EVALUATION OF HASH FUNCTIONS

There are 3 issues:


1) Evaluation of characters;
2) Speed of evaluation; and
3) The Overflow problem.

Evaluation of Characters

Up till now a simple hash function has been used where the key has always been a
numeric value. Often we wish to use a name, or part of a name, as the key i.e. Mike.

The Hash is h = (Mike) mod Table-size

We evaluate the string “Mike” by replacing each character by its numeric value.

In the ASCII character set there are 128 characters. 52 letters, special characters +,- etc
and control characters like NULL. The numeric values of the characters are:

A = 65 J = 74 S = 83 a = 97 j = 106 s = 115
B = 66 K = 75 T = 84 b = 98 k = 107 t = 116
C = 67 L = 76 U = 85 c = 99 l = 108 u = 117
D = 68 M = 77 V = 86 d = 100 m = 109 v = 118
E = 69 N = 78 W = 87 e = 101 n = 110 w = 119
F = 70 O = 79 X = 88 f = 102 o = 111 x = 120
G = 71 P = 80 Y = 89 g = 103 p = 112 y = 121
H = 72 Q = 81 Z = 90 h = 104 q = 113 z = 122
I = 73 R = 82 i = 105 r = 114

17
Thus “ Mike” = M * 1283 + i * 1282 + k * 1281 + e * 1280
= 77 * 1283 + 105 * 1282 + 107 * 1281 + 101 * 1280
= 77 * 128 * 128 * 128 + 105 * 128 * 128 + 107 * 128 + 101
= 77 * 2097152 + 1720320 + 13696 + 101
= 161480704 + 1720320 + 13696 + 101
= 163 214 821
Now this is quite a big number. Will it fit into a computer word?
If the computer word is 32 bits long the maximum integer it can hold is:
232 –1 = 4 294 960 000 (approx)

So our hash value will fit in (But if we added a 5th character it would not).

If the computer word is 16 bits long the maximum integer it can hold is:
216 –1 = 65536 –1 = 65535

This is far to short. Most of the contribution from the “M” and the “i” will be lost in
overflow. If the key has more characters in it the situation gets worse. Effectively the
leading characters play only a very small part in the total because they are lost in
overflow.

Speed of Evaluation

As you can see there are quite a number of additions and multiplications to be done in
this calculation. There is a simple way to minimize these.

H = k2 * 1282 + k1 * 1281 + k0
Can be written as
H = k2 * 128 * 128 + k1 * 128 + k0 which requires 3 multiplications and 2 additions.

It can also be written as:


H = ( (k2 * 128) + k1 ) *128 + k0 which only requires 2 multiplications and 2 additions

This formulation is better.


If we had 4 constants then
H = k3 * 128 * 128 * 128 + k2 * 128 * 128 + k1 * 128 + k0
which requires 6 mltiplications and 3 additions.

H =( ( (k 3 * 128) + k2 ) *128 + k1 ) * 128) + k0


which only requires 4 multiplications and 3 additions.

This formulation becomes more efficient as the number of characters in the hash increase.
It is an application of what is known as Horner’s rule

Clearly this is the efficient way of performing the computation.

18
Overflow Problem

In the previous section the example of using “Mike” as a key was illustrated. The value
obtained for the hash was very big. It was however small enough to fit into a 32 bit word.
If a key of “Michael” had been used then there would have been a large amount of
overflow. Such a key should however spread the records more evenly over the entire
table.

There are several ways to reduce (or eliminate) overflow. Some of these will now be
described.

Method 1:
Just use UPPER case letters (26 + blank) and the 10 digits. This gives 37 characters in
total rather than 128. Consequently the total of the hash will be smaller and the problem
will be diminished.

H =( ( (k 3 * 37) + k2 ) * 37 + k1 ) * 37) + k0 for 4 character keys.

Method 2:

Just reduce the multiplicative factor. In this instance 37 is reduced to 32 and the Hash
becomes:

H =( ( (k 3 * 32) + k2 ) * 32 + k1 ) * 32) + k0 for 4 character keys.

Clearly the result will be smaller & more likely to fit in. Additionally multiplication by
32 can be done by shifting rather than by multiplication. This is much faster.

The above formulation will work because exactly the same calculation is done to insert a
record as is done to find a record. Using 32 rather than 37 as the multiplicative factor will
certainly compute slightly different hash positions for the same key. Using the 37 will
probably distribute the keys better. Using the 32 will still work, will be faster and will
retain a larger contribution from the characters on the left hand side of the key.

SOME PRACTICAL HASH FUNCTIONS

Assume that a name is to be used as the key.

“Michael Brown” will be used as the example

HASH 1:

Use the first 5 characters of the name: Micha

19
This choice will work. If however there are many names that are the same it will lead to
many clashes

HASH 2:

Use the first 5 characters, excluding vowels, of the name: MchlB

This choice will be better as consonants occur less frequently than vowels so the final
characters chosen should be more random.

HASH 3:

Use the first 3 characters, excluding vowels, of the first name together with the first 2
characters, excluding vowels, of the surname: MchBr

This choice should be even better because of the randomness of the consonants as well as
the randomness of first names and surnames.

HASH 4:

This hash is used by our specific university:

Use the first 3 characters, excluding vowels, of the surname together with the first 3
characters of the first name. If there are not enough consonants or characters use an “x”
BrwMic

In general it is up the designer of the hash function to find a function that spreads the data
to be used as evenly as possible over the table with as few collisions as possible. The
hash functions given above are a few examples. Each person can have fun designing their
own hash function. A word of warning however when you invent your own. Run some
test with your special function & see how many collisions you get for some subset of
your data. Then do the same for one of the above hash functions & see which is best.

QUESTIONS:

1 Explain what is ment by “hashing”.


2 What is ment by a collision?
3 What is primary clustering & why does it occur?
4 What is secondary clustering & why does it occur?
5 What are the advantages & disadvantages of using a long key?
6 What is the advantage of using double hashing as a collision resolution
method?

ANSWERS

20
1 HASHING is a method that uses a function and the key of the record to find a
position to place the record in a table { h(key) = POSITION }. The record must
be inserted, found or removed in CONSTANT time. Generally the range of the
keys is much greater than the size of the table.

2 COLLISION The hash function may produce the same result for several
different keys. A collisions is then said to occur.

3 PRIMARY CLUSTERING can occur when linear probing is used to resolve


collisions. This means that the next available position after the original one is
chosen to place the record. If many records collide they are d as close as possible
to the original position. This causes many cells close together to all be filled.

4 SECONDARY CLUSTERING occurs when quadratic probing is used to


resolve collisions. Exactly the same path away from the original position is used
(1, 4, 9, 16, 25, 36, etc) until an empty slot is found. This exact path is tried until
an empty slot is found. The more collisions there are the longer this path becomes.

5 LONG KEYS
ADVANTAGES : Give a good distribution of keys over the entire table. This
means that the records are evenly distributed over the entire table.
DISADVANTAGES: It takes longer to calculate the hash function. Overflow
problems can occur.

6 DOUBLE HASHING. In quadratic hashing all clashes to a position follow


exactly the same path away from this position. In double hashing DIFFERENT
paths away from the clash position are followed. This reduces the length of path
that has to be traversed before an empty slot is found. Naturally not every clash
follows a different path. Some will follow the same path as others.

PROBLEMS

1) Insert the following records into a table of size 17 using the hash function
h = key mod 17. Linear probing is to be used to resolve collisions. The keys of the
records are: 2, 5, 14, 31, 15, 32, 48, 49

2) Insert the following records into a table of size 13 using the hash function
h = key mod 13. Quadratic probing is to be used to resolve collisions. The keys of the
records are: 1, 2, 5, 14, 10, 23

3) Insert the following records into a table of size 13 using the hash function

21
h = key mod 13. Double hashing is to be used to resolve collisions. The second hash
function is : h2 = 1 + (key Mod 5).
The keys of the records are: 1, 24, 8, 10, 12, 21, 47, 34

SOLUTIONS

Problem1: h= key Mod 17

Key h = key mod 17


2 2 Mod 17 = 2 OK
5 5 Mod 17 = 5 OK
14 14 Mod 17 = 14 OK
31 31 Mod 17 = 14 =>15 OK
15 15 Mod 17 = 15=>16 OK
32 32 Mod 17 = 16=>17 = 0 OK
48 48 Mod 17 = 14=>15=>16=>17 = 0=>1 OK
49 49 Mod 17 = 15=>16=>17 = 0=>1=>2 =>3 OK

Problem2:

1, 2, 5,14, 10, 23

Key h = key Mod 13


1 1 Mod 13 = 1 OK
2 2 Mod 13 = 2 OK
5 5 Mod 13 = 5 OK
14 14 Mod 13 = 1 Full, 1 + 12 = 2 Full, 1 + 22=5 Full, 1 + 32=10 OK
10 10 Mod 13 = 10 Full, 10 + 12 = 11 OK
23 23 Mod 13 = 10 Full, 10 + 12 = 11 Full, 10 + 22=14 Mod 13 = 1 Full,
10 + 32=19 Mod 13 = 6 OK

Problem3:

h = key mod 13.


h2 = 1+(key Mod 5).
The keys of the records are: 1, 24, 8, 10, 12, 21, 47, 34

Key key Mod 13 1 + (key Mod 5).

1 1 Mod 13 = 1 OK
24 24 Mod 13 = 11 OK

22
8 8 Mod 13 = 8 OK
10 10 Mod 13 = 10 OK
12 12 Mod 13 = 12 OK
1 1 Mod 13 = 1 OK
21 21 Mod 13 = 8 Full 1 + 21 Mod 5 = 1 + 1 = 2
8 + 2 = 10 Full
8 + 2 * 2 = 12 Full
8 + 3 * 2 = 14 Mod 13 = 1 Full
8 + 4 * 2 = 16 Mod 13 = 3 OK
47 47 Mod 13 = 8 Full 1 + 47 Mod 5 = 1 + 2 = 3
8 + 3 = 11 Full
8 + 2 * 3 = 14 Mod 13 = 1 Full
8 + 3 * 3 = 17 Mod 13 = 4 OK
34 34 Mod 13 = 8 Full 1 + 34 Mod 5 = 1 + 4 = 5
8 + 5 = 13 Mod 13 = 0 OK

23
This document was created with Win2PDF available at http://www.daneprairie.com.
The unregistered version of Win2PDF is for evaluation or non-commercial use only.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy