Sunday 14 September 2014

Hash Collisions of the Non-Cryptographic Kind

Recently I had a bug which required me to create a hash collision between two strings. Fortunately it wasn't a cryptographically secure hashing algorithm, it was only used in a hash table. The algorithm itself was pretty simple, as shown below:
int hash(const char* c, size_t len) {
 int h = 0;

 while (len > 0) {
   h = h * 31 + *c++;
   len--;
 }

 return h;
}

I had one string, say "abc" which I needed to have the same hash as "xyz". As this code was written in C all string comparisons were performed using the standard C string functions. Therefore from a comparison point of view the strings "abc", "abc\0efgh" etc. are equivalent as the comparison function would terminate at the NUL. However because the hashing algorithm takes the entire string including any NUL characters their hash values are not equal. This makes it possible to create a string with the following properties:

char* s = "abc\0"+SUFFIX;
strcmp(s, "abc") == 0

and

H(s) != H("abc")
H(s) == H("xyz")

To generate the collision you might be tempted to try brute force, well good luck with that. Being one for pointless pop culture references I recalled the Exorcist, and realized "The Power of Maths Compels You...", isn't that what the film's about?

On second thoughts perhaps it's about something else entirely?
Clearly Maths holds the key to calculating the hash collision. I had free reign on the characters I could choose to generate the collision which makes it simpler. The key is realising what the hashing algorithm actually does. If you expand out to the original code you get something like the following, where S is the string and N is the length of the string.

h = S[0]*31^N-1 + S[1]*31^N-2 + ... + S[N-1]

Does that look familiar? No? Well What if I changed the value 31 to 2 giving:

h = S[0] << N-1 + S[1] << N-2 + ... + S[N-1] << 0

Look more familiar? It turns out all the hashing algorithm is doing is generating a base31 number. Therefore finding a hash collision mathematically is going to be pretty simple. All we need to do is calculate the hash of the string with a dummy suffix, calculate the difference between the hash of the string with the suffix and the destination hash and finally generate a replacement suffix with that value in base31. So for completeness here's my code in C++.

#include <stdio.h>
#include <string>
#include <iostream>
using namespace std;

int H(int h, const string& s) {
  for (char c : s) {
    h = h * 31 + c;
  }
  return h;
}

string collide(const string& target_str, const string& base_str) {
  // Initialize suffix with all zeros
  string suffix(8, 0);

  int target_hash = H(0, target_str);
  int base_hash = H(H(0, base_str), suffix);

  unsigned int diff = target_hash - base_hash;

  for(int i = 7; i > 1; --i) {
    suffix[i] = diff % 31;
    diff /= 31;
  }

  suffix[1] = diff;

  return base_str + suffix;
}

int main() {
  string a = "xyz";
  string b = "abc";
  string c = collide(a, b);

  cout << H(0, a) << " " << H(0, b) << " " << H(0, c) << endl;
  cout << b.compare(c) << " " << strcmp(b.c_str(), c.c_str()) << endl;

  return 0;
}

Sometimes it pays just to sit down and think through seemingly tricky problems as they tend to be simpler than you imagine.