int hash(const char* c, size_t len) { int h = 0; while (len > 0) { h = h * 31 + *c++; len--; } return h; }
I had one string, say "abc" which I needed to have the same hash as "xyz". As this code was written in C all string comparisons were performed using the standard C string functions. Therefore from a comparison point of view the strings "abc", "abc\0efgh" etc. are equivalent as the comparison function would terminate at the NUL. However because the hashing algorithm takes the entire string including any NUL characters their hash values are not equal. This makes it possible to create a string with the following properties:
char* s = "abc\0"+SUFFIX; strcmp(s, "abc") == 0 and H(s) != H("abc") H(s) == H("xyz")To generate the collision you might be tempted to try brute force, well good luck with that. Being one for pointless pop culture references I recalled the Exorcist, and realized "The Power of Maths Compels You...", isn't that what the film's about?
On second thoughts perhaps it's about something else entirely? |
h = S[0]*31^N-1 + S[1]*31^N-2 + ... + S[N-1]
Does that look familiar? No? Well What if I changed the value 31 to 2 giving:
h = S[0] << N-1 + S[1] << N-2 + ... + S[N-1] << 0
Look more familiar? It turns out all the hashing algorithm is doing is generating a base31 number. Therefore finding a hash collision mathematically is going to be pretty simple. All we need to do is calculate the hash of the string with a dummy suffix, calculate the difference between the hash of the string with the suffix and the destination hash and finally generate a replacement suffix with that value in base31. So for completeness here's my code in C++.
#include <stdio.h> #include <string> #include <iostream> using namespace std; int H(int h, const string& s) { for (char c : s) { h = h * 31 + c; } return h; } string collide(const string& target_str, const string& base_str) { // Initialize suffix with all zeros string suffix(8, 0); int target_hash = H(0, target_str); int base_hash = H(H(0, base_str), suffix); unsigned int diff = target_hash - base_hash; for(int i = 7; i > 1; --i) { suffix[i] = diff % 31; diff /= 31; } suffix[1] = diff; return base_str + suffix; } int main() { string a = "xyz"; string b = "abc"; string c = collide(a, b); cout << H(0, a) << " " << H(0, b) << " " << H(0, c) << endl; cout << b.compare(c) << " " << strcmp(b.c_str(), c.c_str()) << endl; return 0; }
Sometimes it pays just to sit down and think through seemingly tricky problems as they tend to be simpler than you imagine.