A while ago I wrote about how hash maps work, but something’s been bugging me. How does the hash function do its thing? I know hash functions make variable length data into fixed length data but how do they do that? To be clear I’m interested in the kind of hash you would use for a hash map, you would definitely want a more secure hash to keep your passwords safe.
Thanks to the magic of the internets, it’s really easy to find the function java uses to calculate a String’s hashcode.
/* Returns a hash code for this string. The hash code for a String object is computed as s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] using int arithmetic, where s[i] is the ith character of the string, n is the length of the string, and ^ indicates exponentiation. (The hash value of the empty string is zero.) Returns: a hash code value for this object. */ public int hashCode() { int h = hash; if (h == 0) { int off = offset; char val[] = value; int len = count; for (int i = 0; i < len; i++) { h = 31*h + val[off++]; } hash = h; } return h; }
Okay great, that’s totally clear, right? ;)
Yeah, I have no idea what it’s actually doing either. But I can find out!
First of all, where are the values of hash, offset, and count coming from? They must be instance variables because they weren’t passed in as parameters. I poked around in the String code a little more and it turns out hash is defaulted to 0 when it’s declared, offset is set to 0 in the constructor, and count is set to the size of the string when it’s created.
The first thing hashCode actually does is checks if hash is 0. If it’s not, then we know we already computed the hash and we can just return it and go on with our day. Makes sense, why do the same calculation over and over again when we can just do it once and store the result? I think that’s the same reason count is stored separately instead of just calling value.length() when you need it. We know the length will never change because Strings are immutable, so why not save ourselves a lookup?
The next weird thing is how the method is adding a number to a char. Chars are characters, not numbers, aren’t they? Well, yes and no. According to the docs, a char is “a single 16-bit Unicode character. It has a minimum value of '\u0000'
(or 0) and a maximum value of '\uffff'
(or 65,535 inclusive).” That 0 to 65,535 part seems suspiciously like a number :) You can also test that out yourself in the Java REPL. It turns out Java will happily treat a char like an int if you ask it to.
The rest of it is pretty simple, we’re just looping through every character in the string and adding (31 * current hashcode) + current character to the existing hashcode.
Okay, but how does that map a string of any length to a hash code of fixed length? Shouldn’t a longer String always have a larger hash code? Not if your hashcode is an integer! Those just roll over into negative numbers if you add too large of a number to them. And because 2 and -821785444 are both integers they take up the same amount of memory, which means that no matter what size String you start with, the hashcode is always the same size.
Another interesting little detail of how hashmaps actually use those hashcodes is that they rehash your hashes. If everyone used random Strings for keys then they wouldn’t need to, but because keys are usually Strings with some kind of meaning, that means the hashes for those keys won’t be evenly distributed. That is, a hashcode doesn’t have an equal chance of being any number from -231 to 231-1, you’re going to get clumps of hashes around some numbers because you’re more likely to use some Strings than others.
Great, but why does that matter? Performance! The more collisions you have (different Strings that happened to work out to the same hashCode), the more elements you need to look at to find the one you wanted and the worse your performance is. To get around that, java does some bitwise operations on the hashcode to reduce the number of collisions.
Now we all have some idea what actually happens when you use a HashMap :)