Hashing in JavaScript

We know that dictionaries are very important data structures. But did you know they can be improved on from our previous implementations using arrays and lists? With a concept called hashing, we can create an excellent implementation of dictionaries that efficiently handle all of that linked memory.

This article is part of the Data Structures and Algorithms Series following the course The Analysis of Algorithms. If you missed the previous article, check that out first as well as a copy of The Algorithm Design Manual.

Answers to the Daily Problem

To solidify our understanding of dictionaries, the Daily Problem focuses on utilizing the primary functions of dictionary objects:

You are given the task of reading in n numbers and then printing them out in sorted order. Suppose you have access to a balanced dictionary data structure, which supports each of the operations search, insert, delete, minimum, maximum, successor, and predecessor in O(log n) time.

So basically it’s asking us to use some creative ju-jitsu with dictionaries in various situations.

Explain how you can use this dictionary to sort in O(n log n) time using only the following abstract operations: minimum, successor, insert, search.

Sorting in n log n is a pretty common worst-case for good sorting algorithms. It basically means we have to touch all n numbers and then divide and conquer the sorting part (which is log n). With these 4 functions available, we can sort with something like this:

The book does this slightly differently - first it adds all of the numbers to the list, then it sorts it using the successor() function, but I’m assuming we already have a list that we can copy into a new one by starting with the minimum() object and inserting it, presorted, into the new dictionary.

function sort(n) {
  let m = 0, min = minimum(n), newDict = new Dictionary();
  while(m !== n) {
    newDict.insert(min);
    min = n.successor(min);
    m++;
  }
  return newDict;
}

Explain how you can use this dictionary to sort in O(n log n) time using only the following abstract operations: minimum, insert, delete, search.

In this case, we keep stripping off the smallest item from the old list of n numbers. Each time we grab the minimum, we add it into the new dictionary, and then delete it from the old, unsorted dictionary.

This is actually a perfect example of what the book refers to as recursive data structures. When we remove one item from the dictionary, we are left with just a smaller dictionary. We can exploit this by recursively grabbing the smallest item and adding it to our new dictionary to create a sorted object.

function sort(n) {
  let min, newDict = new Dictionary();
  while (n) {
    min = minimum(n);
    newDict.insert(min);
    delete(n, min);
  }
  return newDict;
}

Explain how you can use this dictionary to sort in O(n log n) time using only the following abstract operations: insert and in-order traversal

function sort(n) {
  let newDict = new Dictionary();
  for (let i = 0; i < n.length; i++) {
    newDict.insert(i);
  }
  console.log(newDict.inorderTraverse());
  return newDict;
}

This is the best I could think of, I actually don’t really understand why the book thought this was a good solution. The idea is to copy all of the items into the new list, and then do the sorting by printing out the traversal of all items in-order. This is more congruent with the rest of their sorting algorithms, whereby they insert stuff first, and then handle the sorting part. This isn’t really sorting so much as it is taking advantage of using a dictionary like a Binary Search Tree and then traversing by binary search.

Arrays + Linked Lists = Hash Tables

While dictionaries can be implemented by using arrays or linked lists, we can get significant efficiency improvements by implementing dictionaries using arrays and linked lists. The key to connect those two is to use hashing.

A hashing function is just a way to translate something into a numeric representation. There is no one singular good hash function (e.g. SHA-256, BSD checksum, Rabin fingerprint), but what matters is that it’s cheap to calculate and it’s evenly distributed. The number from that function can then be used as the index in an array.

class HashTable {
  ENGLISH_ALPHABET_LENGTH = 26;
  TABLE_SIZE = 500;

  hash(str) {
    let sum = 0;
    for (let c = 0, len = str.length; c < len; c++) {
      sum += Math.pow(ENGLISH_ALPHABET_LENGTH, len-c+1) * char(c);
    }
    return sum % TABLE_SIZE;
  }
}

So that’s a sample hashing function to translate letters into really big numbers. Why are we making the number super big? So there’s a high chance that the key (the word that represents something like an array index for our dictionary) we are translating will be a unique number.

Now we can expand on our HashTable to start implementing our dictionary functions!

class HashTable {
  // assume we have the hashing function and constants
  // that we've already implemented

  constructor() {
    this.table = new Array(TABLE_SIZE);
  }

  search(item) {
    let list = this.table[hash(item.key)];
    return list ? list.search(item.value) : null;
  }

  insert(item) {
    let idx = hash(item.key);
    let list = this.table[idx];

    if (list) {
      list.insert(item.value);
    } else {
      list = new LinkedList();
      list.insert(item.value);
      this.table[idx] = list;
    }
  }

  delete(item) {
    let idx = hash(item.key);
    let list = this.table[idx];

    if (list) {
      // delete may have to copy list
      // in the event the linking is broken
      list.delete(item.value);
    }
  }
}

Now you can see how the two data structures make a hash table a great implementation of a dictionary! We get the fast search benefits of an Array, with the faster inserts/deletes and safety of a LinkedList in the event that multiple strings happen to map to the same hashing function. That means all of these operations are expected to run in constant O(1) time, and O(n) in the worst-case, the difference being the event of a hash collision.

If we have no or few collisions, aided by an excellent hashing function, then the linked lists are never filled with more than item, so search and modification is instant. If we have a bad hash function or our luck is bad (luck meaning that we happen to choose words whose hashing functions all result in the same number) then we’ll have to operate on just one long linked list which is no different than the O(n) runtimes we calculated for lists.

Practical applications for hash functions and hash tables

With a HashTable implementation in our repertoire, we can start to identify use cases when this data structure is good for us to utilize. In general, hash tables and hashing functions help speed up algorithms that would take O(n log n) or O(n^2) and turn them into linear time (O(n)) algorithms.

Pattern matching strings with Rabin–Karp

Strings are an array of characters with the order preserved so we can express things like words in a sentence. We can find patterns in our words (like substrings, palindromes, etc.) using searching algorithms that make use of a hashing function. One such algorithm is Rabin–Karp.

function rabinKarp(str, pattern) {
  let hashPattern = hash(pattern);
  let patternLength = pattern.length;

  for (let i = 0, len = str.length - patternLength; i++) {
    let substring = str.substring(i, i+patternLength-1);
    let hashString = hash(substring);

    if (hashString === hashPattern) {
      if (substring === pattern) {
        return i;
      }
    }
  }

  return false;
}

The idea here is that we take advantage of the fact that our hash functions return unique but predictable numbers for strings of words. That means that if we find the same big number inside of our string as the hash number of our pattern, it’s probably a match.

I say “probably” because as we know from our hash table implementation, there is a slight possibility of hash collisions that could generate false positives. Luckily, this is exceedingly rare as the length of our table is large and a prime number (primes of course can only be divided by 1 and themselves, making them harder to have collisions since other numbers can’t divide into the same answer).

Duplication

Since Rabin–Karp helps us detect patterns within a string, we can apply this algorithm to a variety of real-world scenarios that exploit the fact that duplicate hash values indicate that strings are the same:

Unique web pages on the web. If web pages are just one long document of words and links, then the entire text of the HTML document can be represented as one giant number (via hashing, of course). For crawlers like Google, rather than having to slowly crawl all of the words of a document to verify it is unique, it just has to see if the hash value is in its mega list of hashes. If it isn’t we know we have a new web page to add to the crawler.
Plaigarism. Copying text can be easy to detect because the hashes should be the same, just like crawling web pages. The only problem is adding in a few filler words, some spaces, different formatting… with a few small changes you can create a unique hash and make it impossible to compare one document to the next. Instead, we can leverage pattern matching by scanning the document in chunks and seeing if the vast majority of sub-hashes are the same. This is obviously a downside because it requires significantly more space to store and compare those hashes, but this is still smaller than having to search against every single word betwee the two documents.
Save when changed. Just as we know that a hash function can numerically represent an entire HTML document, we can hash an entire file with a cryptographic hash to ensure that the document has not been modified by an unauthorized third party. You ever see messages from your laptop warning you that “this third-party application has not been signed?” This is exactly what this is for! Developers send over a hash with their saved changes to prove they were the last ones to touch their programs. They then send those hashes either to app stores or directly to their customers to let them know that their programs haven’t been hacked by outside developers, ensuring that their programs can be safely run on other machines.

As you can see, hashing, whether with a table or as a simple function, is a very practical mechanism for dealing with human language data in the form of strings. Strings are the fundamental formal translation of words into arrays, and the hashes are the translators that turn these characters into numbers. This makes dealing with lots of string data very efficient when we hash them out into numeric data.

The next Daily Problem (4-3)

This round our Daily Problem is back to coming directly from the book, with a preview into our next segment, sorting:

Take a sequence of 2n real numbers as input. Design an O(n log n) algorithm that partitions the numbers into n pairs, with the property that the partition minimizes the maximum sum of a pair. For example, say we are given the numbers (1,3,5,9). The possible partitions are ((1,3),(5,9)), ((1,5),(3,9)), and ((1,9),(3,5)). The pair sums for these partitions are (4,14), (6,12), and (10,8). Thus the third partition has 10 as its maximum sum, which is the minimum over the three partitions.

Liked this post? You'll probably also like:

Get the FREE UI crash course

Sign up for our newsletter and receive a free UI crash course to help you build beautiful applications without needing a design background. Just enter your email below and you'll get a download link instantly.