What I learn from the souce code of Java HashMap
Hash table related question is one of the most common interview questions for software engineer. You may be asked the simple question like “what is the time complexity of inserting a key-value pair”, or a very complex qeustion like “how do we design a hash table with these requirements …”. So, understanding the details of the hash table is very necessary. HashMap is the most commonly-used implementation of hash table in Java. I read the souce code of it. Here is what I learn from it.
Basic Idea
We all know that the hash table is an array under the hood. When we insert a pair of key and value, we hash the key to a integer, which is the index of the array. And we place the value in the bucket. Looks very simple, but there are several problems we have to deal with. I pick up three key problems and we will take a look at how Java’s HahsMap solves those problems.
1. How to hash?
You may guess that we just need to use key.hashCode(). However, in the source code, the hash code is hashed again by a hash method:
int hash = hash(key.hashCode());
Let’s take a look at what the hash method looks like:
static int More ...hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
Don’t know what is going on? Well, basically, this method helps us avoid collisions. Imagine the hash code of the key is not very good (evenly distributed), this method gives us the second chance to redo the hash process. That’s it.
After we get the hash value, we need to find the bucket’s index:
static int indexFor(int h, int length){
return h & (length - 1);
}
Because length always is power-of-two, length - 1 is always a number with all 1s. So the return value is always less than length.
2. How to deal with collision?
What if collision happens? If you learn hash table in class, you probably remember there are basically two methods: seprate chaining and open address. HashMap is using seprate chaining. Every pair of key-value is store in an Entry class:
static class More ...Entry<K,V> implements Map.Entry<K,V> {
final K key;
V value;
Entry<K,V> next;
final int hash;
}
As we can see, it has key and value. It also a pointer pointing to the next Entry, just like the linked list.
When inserting a new Entry to a bucket and there is already an entry there, the new one will be in the bucket. Its Next is pointing to the old one.
3. How to resize?
There are several key variables: loadFactor, size, threshold. loadFactor is a float number between 0 and 1. size is the number Entries that have been placed in the array. threshold = size * loadFactor.
If size >= threhold (this typically happnes when inserting new Entry), we want to enlarge the array to avoid collision. When resizing, Java initializes a new array which has doubles the size of the old one. Then iterate every entry in the old array, place them in the right slot of the new array.
Interestingly, I don’t find resize is used when removing Entries as I though shrinking the array would save more memory. Probably resizing in this case is not necessary since it’s time-consuming.