Hashing
Hashing
INTRODUCTION
Hashing is the term used for inserting or retrieving a record into a table (or file) in
constant time.
In this Chapter some commonly used terms are defined. Hashing is explained in greater
detail. Firstly a trivial example is done to set the scene. This example is then extended to
real situations. It is possible for the Hash function to produce the same result for different
keys. This phenomenon is known as a collision. Four different solutions for resolving
collisions are discussed and compared.
Using keys comprising characters is discussed. Also computational issues that arise when
large keys are used are discussed and solutions given
Finally some commonly used hash functions are described and discussed.
TERMS USED
KEY: The key is some data associated with the record that UNIQUELY identifies the
specific record. ( ie The record may consist of: Name, Address, ID number. The ID
number is UNIQUE (no 2 people have the same ID) thus it can be used as the key of the
record).
X DIV Y: The DIV function returns the integer value equal to the number of times Y
divides into X. Any remainder is disreguarded. (X and Y are integers).
Eg 12 DIV 3 = 4 (Exact)
13 DIV 3 = 4 (Remainder 1, which is disreguarded)
14 DIV 3 = 4 (Remainder 2, which is disreguarded)
15 DIV 3 = 5
X MOD Y: The MOD function returns the remainder after division of X by Y. (X and
Y are integers).
Eg 27 Mod 10 = 7 (2 * 10 + 7 = 27)
17 Mod 10 = 7
36 Mod 10 = 6 (3 * 10 + 6 = 26)
5 Mod 10 = 5 (0 * 10 + 5 = 5)
1
30 Mod 10 = 0 30 Mod 11 = 8
30 Mod 13 = 4 26 Mod 13 = 0 131 Mod 13 = 1
X = (X DIV Y) * Y + (X MOD Y)
HASH FUNCTION h: h(key) = POSITION in which to place the record in the table.
This is a function, that uses the key of the record, to calculate the position in which to
place the record in the table.
COLLISION: The hash function may produce the same result for several different keys.
Collisions are then said to occur. These will be resolved a little later on.
TRIVIAL EXAMPLE
The ideal data structure is an array of records called TABLE as shown below:
In other words if the key is 3 then you will find the desired record in the 4th row of the
array (remember that arrays start at row 0).
2
1) we can insert, find or delete a record in CONSTANT time (it is just 1 operation);
and
2) the table is exactly the correct size (neither too big nor too small).
In this example it is not necessary to store the ID number because it is the same as the
row index of the table. Later on, in more complicated cases where there can be clashes, it
is essential to keep the key because we need to check that we have found the correct
record.
There are still only 10 students but their ID numbers range from 0 to 999.
3
1) Table size is much greater than the number of entries in the table. The Table size
is 1000 and the number of entries is only 10. This means 99% of the table is
wasted space which is very poor space utilization
The challenge is to save space. This is achieved by using a more sophisticated Hash
function. More about this shortly.
In real life there are many problems that show the behaviour above where the number of
entries in the table is far smaller than the maximum entry key. Often however the entries
are more randomly distributed over the table. For example the 10 entries may have the
ID’s of: 2, 45, 67, 123, 345, 346, 459, 554, 721, 999 .
ID Name Tel No
key
81 John 12345
64 Sally 23456
39 Mary 34567
46 Philimon 87654
Hash h = (81) Mod 11 = 4 (11 divides into 81 seven times (77) with a remainder of 4)
ID Name Tel No
0
1
2
3
4 81 John 12345
5
6
7
8
9
10
Position Key
4
To insert the next record: 64 Sally 23456
ID Name Tel No
0
1
2
3
4 81 John 12345
5
6
7
8
9 64 Sally 23456
10
Position Key
ID Name Tel No
0
1
2
3
4 81 John 12345
5
6 39 Mary 34567
7
8
9 64 Sally 23456
10
Position Key
5
To insert the next record: 46 Philimon 87654
h = (46) Mod 11 = 2
ID Name Tel No
0
1
2 46 Philimon 87654
3
4 81 John 12345
5
6 39 Mary 34567
7
8
9 64 Sally 23456
10
Position Key
* IMPORTANT NOTE: From now on I will omit the Name & Tel No information from
the table. In practice it is there. This is just to save time and space in future examples
ID
0
1
2 46
3
4 81
5
6 39
7
8
9 64
10
Position Key
6
Collisions
Life is not perfect with Hashing. The Hashing function can produce the same position to
place two or more records. Clearly this is unacceptable. The simplest solution when this
happens is just to try the next position in the table & so on until an empty slot is found.
h = (20) Mod 11 = 9
ID
0
1
2 46
3
4 81
5
6 39
7
8
9 64
10 20
Position Key
h = (31) Mod 11 = 9
The mathematical description of going beyond the end of the table & wrapping around to
the start to the table can be simply written as Mod (Table-size) or Mod 11 in this case.
7
The final table is:
ID
0 31
1
2 46
3
4 81
5
6 39
7
8
9 64
10 20
Position Key
8
HASHING with OPEN ADDRESSING -- LINEAR PROBING
What I have described is Hashing with open addressing with collisions resolved by
Linear probing.
This can be written in a general form as: Hash = (h(x) + f(i) i=0,1,2…) Mod (Table-size)
In the next example: h(x) = key Mod 11, the Table-size = 11 and f(i) = i, giving
For illustration purposes 5 records are hashed into the table. These records have the keys:
51, 62, 74, 19 ,73
Key Hash
51 7 OK
62 7 => (7 + 1) = 8 OK
74 8 => (8 + 1) = 9 OK
19 8 => 9 =>10 OK
73 7 => 8 => 9 => 10 =>11 = 0 OK
Now lets FIND record 73. We use exactly the same procedure as inserting the record.
Key Hash
73 7 Is the key in position 7 = 73 (51<>73) No Try next position
8 Is the key in position 8 = 73 (62<>73) No Try next position
9 Is the key in position 9 = 73 (74<>73) No Try next position
10 Is the key in position 10 = 73 (19<>73) No Try next position
0 Is the key in position 0 = 73 (67<>73) Yes Found
Right this example WORKS.
9
Now : DELETE record 74. The resulting table is:
Lets try to FIND record 73. We use exactly the same procedure as inserting the record.
Key Hash
73 7 Is the key in position 7 = 73 (51<>73) No Try next position
8 Is the key in position 8 = 73 (62<>73) No Try next position
9 Record EMPTY thus stop search => record NOT found. This is incorrect.
WE needed to mark this record as DELETED so the search can continue on to position
10 then 0 where the record is found.
We try 12 =1 position away from the original hash, then 22 = 4 positions away, then 32 =
9 positions away & so on.
In the next example: h(x) = key Mod 11, the Table-size = 11 and f(i) = i2
Which gives;
10
For illustration purposes 6 records are hashed into the table. These records have the keys:
51, 62, 74, 19, 73
Key Hash
51 7 OK
62 7 => (7 + 12) = 8 OK
74 8 => (8 + 12) = 9 OK
19 8 => (8 + 1 ) = 9 => (8 + 22) = 12 Mod(11) = 1
2
OK
73 7 => (7 + 12) = 8 => (7 + 22) = 11 Mod(11) = 0 OK
2 2 2
84 7 => (7 + 1 ) = 8 => (7 + 2 ) = 11 Mod(11) = 0 => (7 + 3 ) = 16 Mod(11) =5 OK
95 7 => (7 + 12) = 8 => (7 + 22) = 11 Mod(11) = 0 => (7 + 32) = 16 Mod(11) =5
=> (7 + 42) = 23 Mod(11) = 1 => (7 + 52) = 32 Mod(11) = 10 OK
106 7 Now try it yourself – you will discover that while there are still empty
slots in the table the hash function wont find them – POTENTIAL DISASTER! Don’t
worry there is a solution.
11
3) While quadratic probing is better than linear probing it still has a disadvantage.
Every record that hashes to the same position ( 62, 73, 84 ,95 ) all follow exactly
the same path away from the original position occupied by 51. As more records t
hash to this position the longer the collision resolution process takes to find an
empty slot. This is known as secondary clustering
When we have a collision we use a second hash function to determine the distance away
from the first position to try to placethe record, if that fails we move that distance again &
so on i.e.
We try at:
h1(x)
h1(x) + h2(x)
h1(x) + 2 * h2(x)
h1(x) + 3 * h2(x); and so on
To ensure that this method words the following conditions must hold:
1) h2(x) must NOT be ZERO ( If it was ever zero then not other positions would ever
be tried)
2) All cells must be capable of being tried. The following formula for h2(x) is a
good one:
h2(x) = R – (x mod R) where R is a prime number less than the Table-size
In the next example: h1(x) = key Mod 11, the Table-size = 11 and
h2(x) = 7 – (key mod 7)
The following 6 records with keys: 38 , 1, 16 ,49, 11, 60 will be hashed to the table.
12
Empty After After After After After After
38 1 16 49 11 60
0 11 11
1 1 1 1 1 1
2
3 60
4
5 38 38 38 38 38 38
6
7
8 49 49 49
9
10 16 16 16 16
In quadratic hashing all clashes to a position follow exactly the same path away
from this position. In double hashing DIFFERENT paths away from the clash
position are followed. This reduces the length of path that has to be traversed
before an empty slot is found. Naturally not every clash follows a different path.
Some will follow the same path as others.
13
REHASHING
Rehashing is the process that needs to be done when the table of entries gets too full.
Normally one rehashes once the table becomes 50% full. The reason for this is that a)
when linear probing is used the number of attempts required to find an empty slot
increases dramatically as the table fills; and b) with quadratic probing the method is only
guaranteed to work is the table is less than 50% filled.
2) Go through all the entries in the old table and rehash each one to the new table. In this
example the old hash was h1 = key Mod 5 for the table of size 5. For the new table of size
11 the new hash would be h2 = key Mod 11.
Example:
14
New 11 slot table after rehashing completed
SEPARATE CHAINING
A final method of resolving collisions is to use a table with one entry for every possible
hash value. When a collision occurs a linked list is created with the new record being
inserted at the end of that particular list.
Example:
Hash h = Key Mod 11
Key Hash
22 Joe 0
26 Pam 4
39 Zulu 6
48 Henk 4
17 Bob 6
15
The disadvantage of the method is:
a. Space has to be allocated for the pointer within each record
b. It can take considerable time to find the required record.
16
Example1:
See example above:
Table size = M = 10
No. entries = N = 5
Load factor = x = N/M = 5/10 = .5 (Average length of each linked list)
Example 2:
Table size = M = 10
No. entries = N = 20
Load factor = x = N/M = 20/10 = 2 (Average length of each linked list)
Evaluation of Characters
Up till now a simple hash function has been used where the key has always been a
numeric value. Often we wish to use a name, or part of a name, as the key i.e. Mike.
We evaluate the string “Mike” by replacing each character by its numeric value.
In the ASCII character set there are 128 characters. 52 letters, special characters +,- etc
and control characters like NULL. The numeric values of the characters are:
A = 65 J = 74 S = 83 a = 97 j = 106 s = 115
B = 66 K = 75 T = 84 b = 98 k = 107 t = 116
C = 67 L = 76 U = 85 c = 99 l = 108 u = 117
D = 68 M = 77 V = 86 d = 100 m = 109 v = 118
E = 69 N = 78 W = 87 e = 101 n = 110 w = 119
F = 70 O = 79 X = 88 f = 102 o = 111 x = 120
G = 71 P = 80 Y = 89 g = 103 p = 112 y = 121
H = 72 Q = 81 Z = 90 h = 104 q = 113 z = 122
I = 73 R = 82 i = 105 r = 114
17
Thus “ Mike” = M * 1283 + i * 1282 + k * 1281 + e * 1280
= 77 * 1283 + 105 * 1282 + 107 * 1281 + 101 * 1280
= 77 * 128 * 128 * 128 + 105 * 128 * 128 + 107 * 128 + 101
= 77 * 2097152 + 1720320 + 13696 + 101
= 161480704 + 1720320 + 13696 + 101
= 163 214 821
Now this is quite a big number. Will it fit into a computer word?
If the computer word is 32 bits long the maximum integer it can hold is:
232 –1 = 4 294 960 000 (approx)
So our hash value will fit in (But if we added a 5th character it would not).
If the computer word is 16 bits long the maximum integer it can hold is:
216 –1 = 65536 –1 = 65535
This is far to short. Most of the contribution from the “M” and the “i” will be lost in
overflow. If the key has more characters in it the situation gets worse. Effectively the
leading characters play only a very small part in the total because they are lost in
overflow.
Speed of Evaluation
As you can see there are quite a number of additions and multiplications to be done in
this calculation. There is a simple way to minimize these.
H = k2 * 1282 + k1 * 1281 + k0
Can be written as
H = k2 * 128 * 128 + k1 * 128 + k0 which requires 3 multiplications and 2 additions.
This formulation becomes more efficient as the number of characters in the hash increase.
It is an application of what is known as Horner’s rule
18
Overflow Problem
In the previous section the example of using “Mike” as a key was illustrated. The value
obtained for the hash was very big. It was however small enough to fit into a 32 bit word.
If a key of “Michael” had been used then there would have been a large amount of
overflow. Such a key should however spread the records more evenly over the entire
table.
There are several ways to reduce (or eliminate) overflow. Some of these will now be
described.
Method 1:
Just use UPPER case letters (26 + blank) and the 10 digits. This gives 37 characters in
total rather than 128. Consequently the total of the hash will be smaller and the problem
will be diminished.
Method 2:
Just reduce the multiplicative factor. In this instance 37 is reduced to 32 and the Hash
becomes:
Clearly the result will be smaller & more likely to fit in. Additionally multiplication by
32 can be done by shifting rather than by multiplication. This is much faster.
The above formulation will work because exactly the same calculation is done to insert a
record as is done to find a record. Using 32 rather than 37 as the multiplicative factor will
certainly compute slightly different hash positions for the same key. Using the 37 will
probably distribute the keys better. Using the 32 will still work, will be faster and will
retain a larger contribution from the characters on the left hand side of the key.
HASH 1:
19
This choice will work. If however there are many names that are the same it will lead to
many clashes
HASH 2:
This choice will be better as consonants occur less frequently than vowels so the final
characters chosen should be more random.
HASH 3:
Use the first 3 characters, excluding vowels, of the first name together with the first 2
characters, excluding vowels, of the surname: MchBr
This choice should be even better because of the randomness of the consonants as well as
the randomness of first names and surnames.
HASH 4:
Use the first 3 characters, excluding vowels, of the surname together with the first 3
characters of the first name. If there are not enough consonants or characters use an “x”
BrwMic
In general it is up the designer of the hash function to find a function that spreads the data
to be used as evenly as possible over the table with as few collisions as possible. The
hash functions given above are a few examples. Each person can have fun designing their
own hash function. A word of warning however when you invent your own. Run some
test with your special function & see how many collisions you get for some subset of
your data. Then do the same for one of the above hash functions & see which is best.
QUESTIONS:
ANSWERS
20
1 HASHING is a method that uses a function and the key of the record to find a
position to place the record in a table { h(key) = POSITION }. The record must
be inserted, found or removed in CONSTANT time. Generally the range of the
keys is much greater than the size of the table.
2 COLLISION The hash function may produce the same result for several
different keys. A collisions is then said to occur.
5 LONG KEYS
ADVANTAGES : Give a good distribution of keys over the entire table. This
means that the records are evenly distributed over the entire table.
DISADVANTAGES: It takes longer to calculate the hash function. Overflow
problems can occur.
PROBLEMS
1) Insert the following records into a table of size 17 using the hash function
h = key mod 17. Linear probing is to be used to resolve collisions. The keys of the
records are: 2, 5, 14, 31, 15, 32, 48, 49
2) Insert the following records into a table of size 13 using the hash function
h = key mod 13. Quadratic probing is to be used to resolve collisions. The keys of the
records are: 1, 2, 5, 14, 10, 23
3) Insert the following records into a table of size 13 using the hash function
21
h = key mod 13. Double hashing is to be used to resolve collisions. The second hash
function is : h2 = 1 + (key Mod 5).
The keys of the records are: 1, 24, 8, 10, 12, 21, 47, 34
SOLUTIONS
Problem2:
1, 2, 5,14, 10, 23
Problem3:
1 1 Mod 13 = 1 OK
24 24 Mod 13 = 11 OK
22
8 8 Mod 13 = 8 OK
10 10 Mod 13 = 10 OK
12 12 Mod 13 = 12 OK
1 1 Mod 13 = 1 OK
21 21 Mod 13 = 8 Full 1 + 21 Mod 5 = 1 + 1 = 2
8 + 2 = 10 Full
8 + 2 * 2 = 12 Full
8 + 3 * 2 = 14 Mod 13 = 1 Full
8 + 4 * 2 = 16 Mod 13 = 3 OK
47 47 Mod 13 = 8 Full 1 + 47 Mod 5 = 1 + 2 = 3
8 + 3 = 11 Full
8 + 2 * 3 = 14 Mod 13 = 1 Full
8 + 3 * 3 = 17 Mod 13 = 4 OK
34 34 Mod 13 = 8 Full 1 + 34 Mod 5 = 1 + 4 = 5
8 + 5 = 13 Mod 13 = 0 OK
23
This document was created with Win2PDF available at http://www.daneprairie.com.
The unregistered version of Win2PDF is for evaluation or non-commercial use only.