Mastering Hash Maps: The Key to Efficient Data Storage

Mastering Hash Maps: The Key to Efficient Data Storage

A hash map is arguably one of the most important data structures. Let's start by defining what a hash map is:

A hash map is a data structure that is used to store a collection of key-value pair for efficient retrieval.

While I was researching on hash maps, I came across this example from reddit:

Think of it like a valet parker who's retrieving you car. When they parked you car they used an algorithm to hash your license plate number that would tell them which space to park it. When you come to get your car, you tell them your license plate number and they use the same algorithm to figure out where that car would have been parked. No need to walk up and down the rows of cars until they found it.

Looking at that example, you can see how important a hash table can be. Generally, the time complexity for retrieving an element is O(1). In the worst case, it will be O(n), and I will explain how this can happen as we go.

Hash Function

Just like in our earlier example, a hash map requires a hash function that takes a key (usually a string) and converts it to an integer, known as a hash value, which we can use to index and retrieve our element. There are many hash functions available, but in our case, we will be using the FNV-1a (Fowler-Noll-Vo) 32-bit variant. This function helps minimize collisions as much as possible while being relatively simple to implement.

Collisions

There are cases where certain keys will share the same hash values; this is known as a collision, and we will need to handle these cases. In a hash map, we want as few collisions as possible. This depends on the hash function we are using and the table size, as you will see later on. To handle collisions, we will use a technique called separate chaining. If you aren't familiar with linked lists, I recommend going through my guide on Linked List

Coding Starts Here

Our goal is to store a phonebook and efficiently retrieve an entry using the user's name as a key. We’ll make it generic so you can also test it with your own data if you wish.

Objectives:

  • Implement the FNV-1a hash function.
  • Print our hash table.
  • Create a new entry.
  • Retrieve an entry.
  • Delete an entry.
  • Calculate our collision rate for testing.
  • Use the data from our phonebook file as entries for our hash map.

Implement the FNV-1a hash function.

#include <ctype.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #define FNV_OFFSET_BASIS 2166136261u // FNV-1 Constant value used in our function #define FNV_PRIME 16777619u // FNV-1 Constant value used in our function #define TABLE_SIZE 10000 // about 1 million entries // generic entry, so our data is a void pointer typedef struct entry { void *data; char* key; struct entry *next; // since we are using separate chaining, we need to point to the next entry if there is a collision } entry; u_int32_t hash_func(const char *key) { const u_int32_t len = strlen(key); u_int32_t hash = FNV_OFFSET_BASIS; for (u_int32_t i = 0; i < len; i++) { hash ^= (u_int8_t)key[i]; hash *= FNV_PRIME; } return hash; }

First, we get the length of our key using strlen, then we set our initial hash value as FNV_OFFSET_BASIS. Next, we iterate over each character in the string, first performing an XOR operation, followed by a multiplication. If you aren’t familiar with bitwise operations, you can check out my short and straightforward post on Bitwise Operations.

Print our hash table.

void print_table(void (print_data_func)(void *)) { for (int i = 0; i < TABLE_SIZE; i++) { const entry *entry = table[i]; if (entry == NULL) { continue; } printf("%d: ", i); while (entry != NULL) { print_data_func(entry->data); entry = entry->next; } printf("\n"); } }

We iterate over our table, retrieving the current entry in each iteration. We skip if it is NULL; if not, we call our generic function to print the data, then set the entry to the next pointer if there is a collision.

Create a new entry.

entry *create_entry(void *data, u_int32_t (hash_func)(const char*), char* key) { const int index = hash_func(key) % TABLE_SIZE; entry *new = malloc(sizeof(*new)); new->data = data; new->key = key; new->next = NULL; if (table[index] != NULL) { new->next = table[index]; } table[index] = new; return new; }

First, we use the hash_function to get our index. Then, we use malloc to allocate memory for our new entry. If there is a collision, we set the next pointer of our new entry to point to the already existing nodes.

Retrieve an entry

entry *get_entry(const char *key, u_int32_t (hash_func)(const char *)) { const int index = hash_func(key) % TABLE_SIZE; entry *temp = table[index]; while (temp != NULL && strcmp(temp->key, key) != 0) { temp = temp->next; } if (temp == NULL) { fprintf(stderr, "%s not found in the table\n", key); } return temp; }

Retrieving an entry can be as simple as using the index — entry *temp = table[index]; — which has a time complexity of O(1) if there is no collision. However, if there is a collision, the time complexity can be O(n) because we need to iterate through each next node to find the key we are looking for.

Even though an entry may have multiple collisions, the average time complexity for retrieving an entry in a hash map is generally O(1), with the worst case being O(n).

Delete an entry

void delete_entry(const char *key, u_int32_t (hash_func)(const char *)) { const int index = hash_func(key) % TABLE_SIZE; entry *temp = table[index]; entry *prev = NULL; while (temp != NULL && strcmp(temp->key, key) != 0) { prev = temp; temp = temp->next; } if (prev == NULL) { table[index] = temp->next; } else if (temp != NULL) { prev->next = temp->next; table[index] = prev; } else { fprintf(stderr, "%s not found in the table\n", key); } }

Just like retrieving an entry, deleting an entry can be O(1) if there is no collision. However, if there is a collision, it becomes O(n).

Another important thing to note is that deleting or searching for an element in an array is also O(n), where n is the number of elements. Depending on the number of elements or collisions (in the case of hash maps), one method may be better than the other. A general rule of thumb is to use an array for a smaller number of entries; otherwise, a hash map may be more effective.

Calculate our collision rate for testing

float collision_rate(void) { float collions = 0; float no_collions = 0; for (int i = 0; i < TABLE_SIZE; i++) { if (table[i] != NULL && table[i]->next != NULL) { collions++; } else if (table[i] != NULL && table[i]->next == NULL) { no_collions++; } } return collions / (collions + no_collions) * 100; }

This is a helper function that helps us track our collision rate. All we do is iterate over each entry, check for collisions, and increment the number of collisions. If there is no collision, we increment the number of no_collisions. Finally, we return the percentage of no collisions.

Use the data from our phonebook file as entries for our hash map

Lets tie it all together by using some test phonebook data for our hash map. Here is a list of phonebook data: https://github.com/cliffordaustin/HashMaps-in-C/blob/main/phonebook You can download or copy it.

// Helper function for trimming whitespace char *trimwhitespace(char *str) { // Trim leading space while(isspace((unsigned char)*str)) str++; if(*str == 0) // All spaces? return str; // Trim trailing space char *end = str + strlen(str) - 1; while(end > str && isspace((unsigned char)*end)) end--; // Write new null terminator character end[1] = '\0'; return str; } void init(void) { // Initializing our table with NULL entries for (int i = 0; i < TABLE_SIZE; i++) { table[i] = NULL; } } int main(const int argc, char **argv) { init(); if (argc != 2) { fprintf(stderr, "No filename"); return EXIT_FAILURE; } const char *filename = argv[1]; FILE *file = fopen(filename, "r"); if (file == NULL) { fprintf(stderr, "Failed to open file"); return EXIT_FAILURE; } // this is to hold the user name, which will also be our key char buffer[256]; while (fgets(buffer, sizeof(buffer), file) != NULL) { // our phonebook data looks like so: <name> - <number> // strchr get the string starting from the character specified, so in our case, that is the '-' // which is something like so: - <number> char *dash_position = strchr(buffer, '-'); if (dash_position != NULL) { // we remove the '-' by just setting the first, item which '-' to '\0'(null character). *dash_position = '\0'; // Then we get the phonenumber simply by just going one step forward the string literal. char *phone_number = dash_position + 1; // To make sure we are removing the whitespace that comes before the actual number, we just do a while loop until we get to a no whitespace character while (*phone_number == ' ') { phone_number++; } //We then trim any whitespace using that comes after the name of the phonenumber using our `trimwhitespace` helper function. trimwhitespace(buffer); trimwhitespace(phone_number); // we allocate memory for our phone phonebook *phone = malloc(sizeof(*phone)); // we are using `strdup` to creat a copy of our name and number string. // this is to make sure we are not referencing a pointer to our buffer, which changes for each loop. phone->name = strdup(buffer); phone->number = strdup(phone_number); // then we create an entry into our table using our `create_entry` function from earlier. create_entry(phone, hash_func, phone->name); } else { // basically invalid entry. printf("Dash not found in the input string.\n"); } } print_table(print_phonebook_data); const float rate = collision_rate(); printf("########################\n"); printf("Collision Rate = %.2f%%\n", rate); fclose(file); return 0; }

I have added as many comments as possible to help with understanding what is going on, so please go through them.

Finally lets run our code

First, make sure you have the phonebook data downloaded, ideally located in your working directory.

clang main.c -o main
./main ./phonebook

We now have a working hash map!!!

Something to try your hand at:

  • Instead of using phonebook data, try storing your own data.
  • Experiment with a different hash function.
  • Play around with the table size to see how it affects the collision rate.

Happy coding! 👋🏼

You can find the full code here: https://github.com/cliffordaustin/HashMaps-in-C