Hash Tables

6 November 2019

A hash table is a data structure that allows for fast retrieval of data. It does this by running the data through a hashing algorithm that calculates an index in which to store it in the underlying array. It has a lookup time of O(1) in most cases, and O(n) in the worst case.

Hashing Algorithms

A good hashing algorithm is fast (note: not for passwords!) and minimises collisions by having a good distribution of hash values.

For numeric keys, and where n is the number of available addresses (array elements), a common approach to calculate the index is with key % n

For alphanumeric keys, a numeric key is calculated by summing the ASCII codes and then getting the remainder. For example if our key was ABC, we could do (65 + 66 + 67) % n.

There is also the folding method, which involves splitting the key into equal parts, adding the values together and then getting the remainder by diving by some constant (which depends on the size of the table).

Collisions

Of course, by reducing data of arbitrary size down to a single number, there are going to be times where the hash function is going to generate the same index for different values. These are called collisions.

For example in an array that has a length of 10 and we use the modulo approach above, keys J and T are going to produce the same address of 4 from the hashing algorithm.

One method to resolve the collision is linear probing which is under the open addressing group. In this method, each index after the one already occupied is tried until one is available. So if J takes index 4, and K takes index 5, T which would normally be 4, would instead take index 6. If the end of the array is reached, then the method would cycle back to the start of the array and continue from there. Remember that lookups would also need to perform this.

The other open addressing methods are quadratic probing and double hashing, but explanations exceed the scope of this post.

There is also closed addressing, more commonly known as chaining. In this method, each array item is a pointer to the first node of a linked list (or another similar data structure). When the same index is calculated, a new node is created on that specific linked list. So with the same example, J would be the first node of index 4’s linked list, then T would be the second item.

As the more items to be populated into a hash table, the likelyhood of collisions also increases. As we now know that handling collisions can be slow for reads and writes, we could make the array larger than actually needed to reduce collisions. One downside to this is that the memory footprint is then larger. For most applications this would be fine, however.