A hash table differs from the array that provides storage for it because store and retrieve operations are specified by a key value rather than by an index and the number of legal keys is larger than the number of cells in the array. The operations of interest are STORE(Key) and RETRIEVE(Key). It is possible to simply store the key and retrieve a Boolean function that indicates the presence or lack of presence of the key in the table. This is the simple way to discuss hash tables, but it is more likely that the key is indeed just a key in a record containing a Value field or a pointer (or index) that addresses some bulk of information. In a word, hash tables are apt to be exogenous--directories to data stored external to them.
The function HF1 of section 2.5 simply converts letters to integers and sums them. It exhibits two of the properties needed to form a hash table:
HF1 transforms keys into integers.
HF1 also exhibits an unhappy consequence of compression by hash functions:
HF1(BROTH) = HF1(OLIVE)
Compression causes collisions.
The address functions of section 2.3 transform a domain value that is an n-tuple of the form
[i1, i2, . . ., in]
into a range value that is a single index in a one-dimensional table. The correspondence is one-to-one, and so there is no compression; but compression can be introduced much as it was for HF1. For example, a function HF2 can be defined by
HF2([i1, i2, . . ., in]) = (i1 +![]()
![]()
+ in) MOD 25
If a key of any kind can be located in an n-dimensional table at
t = [i1, i2, . . ., in]
then HF2(t) provides the same service as for any n-dimensional table of keys as HF1 does for its specific data space.
Both HF1 and HF2 fail to define their range as a hash table structure by themselves because compression allows collisions: two elements of the domain can have identical range values. In any table structure, RETRIEVE must be the inverse of STORE, and so collisions must be resolved. With this in mind, the desired HASH TABLE structure consists of:
a domain
a range
a hash function h that exhibits compression
a resolution of the collisions caused by h
operations CREATE, STORE, and RETRIEVE
The domain is normally a given in a general sense and may be enormous. For example, it might be identifiers of up to 30 characters in length. The subset of the domain encountered in a specific application tends to be smaller but not well defined, and that makes a compressive system very attractive. CREATE requires no special attention. The range is a design decision that takes into account much of this section.
Suppose that a hash function is available that distributes the domain keys over the indices 0,1, . . . ,24 in a table HT with uniform probability. The first entry will not collide with any other. The probability that the second entry will not collide with a previous entry is 24/25. The probability that the third, fourth, . . . entries will not collide when entered is 23/25, 22/25, . . . (directly proportional to the number of free spaces). The probability that there will be no collisions for seven entries in a table of 25 cells is:
Hence the probability that there will be a collision is near 60 percent even though the table is less than one-third full. The cause is not really the small size of the table. (Figure out how likely a birthday collision is in a room containing 25 people.)
The ratio of n/m for n items in a table of size m is called the load factor, . In this example,
= 7/25 = 0.28. Clearly, the probability of a collision increases with
.
The probability that a table location will be accessed is affected by two things:
the selection of elements from the potential set (the domain)
the transformation of the selected elements by the hash function into range indices
h(z) = INT(m(zMOD 1))
To use the function h as a hash function first requires that the keys be turned into an integer in some efficient way. Keys are stored as binary strings in memory, and a binary string can be decoded as an integer (or sequence of integers) or converted with a utility function (such as ORD in Pascal). For the sake of illustration, consider the following simple approach applied to the ingredients of the recipe in section 2.5:
1. Ingredients are considered to be 12 characters in length, left-adjusted and padded with blank spaces.
2. Letters have as their index a two-digit value: A = 01, B = 02, . . ., Z = 26. Blanks have value 00. These values are found by a table lookup, which is much faster than calculation of them.
3. An ingredient is six four-digit integers. For example, BELLPEPPER is 0205, 1212, 1605, 1616, 0518, 0000.
4. The six integers for an ingredient are summed to form z. In functional terms, q(BELLPEPPER) = 5156 = z.
The calculation of q is much faster than it might appear to be. Let s[1 . . 12] be the string of characters of an ingredient. Then a calculation of z requires only one multiplication:
zs[1] + s[3] + s[5] + s[7] + s[9] + s[11]
zz X 100 + s[2] + s[4] + s[6] + s[8] + s[10] + s[12]
It is possible for z to be as large as 6 X 2626 = 15,756, but taking the sum MOD 10,000 restricts it to four digits (and more than that are unlikely anyway).
Now h(z) for BELLPEPPER, with m = 17 is:
h(z) = INT(m(zMOD 1))
= INT(17(5156MOD 1))
= INT(17(3186.5827 MOD 1))
= INT(17(0.5827))
= INT(9.9059)
= 9
For contrast, a second stage of multiplication by the prime base m can be introduced:
h2(z) = INT(m(m(zMOD 1) MOD 1))
= INT(17(9.9059 MOD 1))
= 15
Both h and h2 can be easily determined with a hand calculator or a small program as a check on how well they work on a particular data set. Multiplication by 100 can be replaced with multiplication by a power of 2 and hence--in assembly languages--by shifts. There are a large number of possible variations of this scheme.
The results, for the 13 ingredients (including BROTH) and prime bases 17, 31, and 47, are shown in Table B.1 .
m = 17 m = 31 m = 47
z h h2 h h2 h h2
-------------------------------------------
BEEF 0711 7 8 13 2 19 39
BELLPEPPER 5156 9 15 18 1 27 18
BLACKPEPPER 5342 9 2 16 20 25 11
BROTH 2538 9 11 17 20 26 37
CARROT 3639 0 7 0 24 1 8
CUMIN 3030 10 15 19 28 30 9
DILLWEED 2330 0 5 0 18 0 41
MUSHROOM 6557 7 10 13 27 21 3
OLIVE 2934 5 4 9 20 14 29
ONION 2443 14 9 26 17 40 12
POTATO 5631 2 9 4 18 6 46
SALT 3121 15 0 27 12 41 25
TOMATOPASTE 9352 14 8 26 13 40 4
Problem PB.1 is appropriate at this point.
The unavoidable conclusion is that any practical use of hash tables produces collisions. One way to implement collision resolution is a technique called linear probing.
Suppose a procedure Hash calculates an index k in a hash table HT[0 . . m - 1] for Key. If the table is empty, STORE(Key) would simply be HT[k] Key. If it is not, HT[k] may already contain a key, and a collision occurs. The first task of a STORE operation is thus to determine if HT[k] is empty. For that purpose HT might be initialized with a value that cannot be a key or associated with a flag array. For the recipe example and most identifiers, HT may be initialized to null, blank, or empty strings.
function Probe(Start,s) {O(m)
iStart
repeat
if HT[i] = s
then return i endif
i(i + l) MOD m
until i = Start
return - i
end {Probe
procedure StoreHT(Key) {O(m)
kHash(Key)
dexProbe(k,Blank)
if dex < 0
then {deal with a full table
else HT[dex]Key
endif
end {StoreHT
function RetrieveHT(Key) {O(m)
kHash(Key)
dexProbe(k,Key)
if dex < 0
then {report not there
else return dex
endif
end {RetrieveHT
m = 17
HT[7]BEEF 0 CARROT
HT[9]BELLPEPPER 1 DILLWEED
HT[9] collision 2 POTATO
HT[10]BLACKPEPPER
HT[9] collision
HT[10] collision 5 OLIVE
HT[11]BROTH
HT[0]CARROT 7 BEEF
HT[10] collision 8 MUSHROOM
HT[11] collision 9 BELLPEPPER
HT[12] CUMIN 10 BLACKPEPPER
HT[0] collision 11 BROTH
HT[1]DILLWEED 12 CUMIN
HT[7] collision
HT[8]MUSHROOM 14 ONION
HT[5]OLIVE 15 SALT
HT[14]ONION 16 TOMATOPASTE
HT[2]POTATO
HT[15]SALT
HT[14] collision
HT[15] collision
HT[16]TOMATOPASTE
Problem PB.2 is appropriate at this point.
Two things about the sequence of HT entries are worth noting:
0Hash(z) < m
0HashAgain(z) < m
Hash(z)HashAgain(z)
2051 + 2121 + 6051 + 6160 + 5180 + 0000 = 21,563
Or simply use 1 - in place of
, or use h2.
With q, Hash = h, HashAgain = h2, m = 31, and alphabetical entry order:
HT[13]BEEF m = 31
HT[18]BELLPEPPER 0 CARROT
HT[16]BLACKPEPPER
HT[17]BROTH
HT[0]CARROT
HT[19]CUMIN 4 POTAT0
HT[0] collision 5 DILLWEED
HT[18] collision
HT[5]DILLWEED (
= 18)
HT[13] collision 8 TOMATOPASTE
HT[9]MUSHROOM (
= 27) 9 MUSHROOM
HT[9] collision
HT[29]OLIVE (
= 20)
HT[26]ONION
HT[4]POTAT0 13 BEEF
HT[27]SALT
HT[26] collision
HT[8]TOMATOPASTE (
= 13) 16 BLACKPEPPER
17 BROTH
18 BELLPEPPER
19 CUMIN
26 ONION
27 SALT
29 OLIVE
Problem PB.3 is appropriate at this point.
An alternate implementation of hash tables that permits deletion is based on the linked lists described in Chapter 3, and it is to be found in section L.
Problems not immediate, but requiring no major extensions of the text material
PB.1 Construct a table of hashed indices like the one of section B. 1 for the calendar months.
PB.2 Construct the hash table for the calendar months entered in calendar order, into a hash table of prime base m = 17.
PB.3 Construct the double-hashing table for the calendar months in calendar order with Hash = h, HashAgain = h2, and m = 31.
Programs for trying it yourself
PGB.1 Write a program that accepts identifiers of up to 12 characters in length, converts them to an integer with q, and stores them in a hash table HT with the use of Hash = h, HashAgain = h2, and m = 47. The program should report the number of collisions, the average number of tests by Probe, and the loading factor.