Optimizing the Capacity of a HashMap

Page contents

Many of us know that allocating more memory than we need may negatively affect the performance of our application. Thus, creating the Lists using a constructor that takes a capacity may make a significant difference.

However, this optimization step might not be so straightforward while using Maps. In this article, we’ll learn how to identify the problem with overallocation and under allocation in a HashMap and, more importantly, how to resolve it.

A Faulty Example

Let’s consider this code snippet where we create a HashMap and populate it with several entries:

public static void main(String[] args) {

    for (int i = 0; i < 1_000_000_000; i++) {

        final HashMap<String, Integer> workDays = new HashMap<>();

        workDays.put(new String("Monday"), 1);

        workDays.put(new String("Tuesday"), 2);

        workDays.put(new String("Wednesday"), 3);

        workDays.put(new String("Thursday"), 4);

        workDays.put(new String("Friday"), 5);

        MAP_ACCUMULATOR.add(workDays);

    }

}

The issue here is that the default initial capacity of the HashMap is sixteen, while we can use eight for the case above without any problems. Luckily, this problem becomes quite visible if we analyze the heap dump using HeapHero:

***Fig. 1: Analyzing inefficient collections with HeapHero***

The Inefficient Collections section shows the collection that takes more space than would suffice. Here, we can see that our HashMaps contain much fewer elements than the collection can accommodate.

However, let’s consider a bit more subtle example:

public static void main(String[] args) {

    for (int i = 0; i < 1_000_000_000; i++) {

        final HashMap<String, Integer> planets = new HashMap<>(8);

        planets.put(new String("Mercury"), 4879);

        planets.put(new String("Venus"), 12104);

        planets.put(new String("Earth"), 12742);

        planets.put(new String("Mars"), 6779);

        planets.put(new String("Jupiter"), 139820);

        planets.put(new String("Saturn"), 116460);

        planets.put(new String("Uranus"), 50724);

        planets.put(new String("Neptune"), 49244);

        MAP_ACCUMULATOR.add(planets);

    }

}

We have eight planets and a map with the capacity set to eight. Some of you already noticed the problem. However, let’s analyze the heap dump for this tiny application. Let’s use the HeapHero analyzer for this as well:

***Fig. 2: In some cases, HeapHero won’t detect the issue directly***

The inefficient collection of information doesn’t show any problems. We must go through several steps and calculations to identify the issue. To do so, we need to compare the retained heap occupied by our maps and the retained heap occupied by the internal array that keeps the nodes. The difference would be the overhead showing how many empty elements we have:

Objects histogram from HeapHero Report — ***Fig. 3: Objects histogram***

As we can see, the overall retained heap of HashMaps is almost 2GB. However, the total retained heap of the nodes is only 1.62 GB. We have nearly 400MB of overhead or 5% of non-utilized space.

According to the heap dump, we can see that the actual size of the Map is higher than we expected, and we wasted some space.

However, there’s not much we can do about this problem. The capacity of the map should always be a power of two. Also, we should consider the load factor: the explanation is provided further in this article. That’s why we have the same space overhead even when creating a Map with the correct capacity.

At the same time, we have another problem while creating underallocated maps. Let’s analyze the code using JMH. We’ll have simple tests that will create Maps in a loop:

@Measurement(iterations = 1, time = 2, timeUnit = TimeUnit.MINUTES)

@Warmup(iterations = 1, time = 10)

@Fork(1)

public class MapCapacityOverhead {

    

    @Benchmark

    @BenchmarkMode(Mode.Throughput)

    public void mapWithUnderestimatedCapacity(Blackhole blackhole) {

        final HashMap<String, Integer> map = new HashMap<>(8);

        map.put(new String("Mercury"),4879);

        map.put(new String("Venus"),12104);

        map.put(new String("Earth"),12742);

        map.put(new String("Mars"),6779);

        map.put(new String("Jupiter"),139820);

        map.put(new String("Saturn"),116460);

        map.put(new String("Uranus"),50724);

        map.put(new String("Neptune"),49244);

        blackhole.consume(map);

    }



    @Benchmark

    @BenchmarkMode(Mode.Throughput)

    public void mapWithCorrectCapacity(Blackhole blackhole) {

        final HashMap<String, Integer> map = HashMap.newHashMap(8);

        map.put(new String("Mercury"),4879);

        map.put(new String("Venus"),12104);

        map.put(new String("Earth"),12742);

        map.put(new String("Mars"),6779);

        map.put(new String("Jupiter"),139820);

        map.put(new String("Saturn"),116460);

        map.put(new String("Uranus"),50724);

        map.put(new String("Neptune"),49244);

        blackhole.consume(map);

    }

}

In the result, we can see that they have very different throughputs:

Benchmark	Mode Cnt	Score	Error	Units
mapWithCorrectCapacity	thrpt	7256575.859		ops/s
mapWithUnderestimatedCapacity	thrpt	5581449.247		ops/s

Using underestimated initial capacity causes the code to be around 25% slower than when we allocate the correct number of buckets while initializing a Map.

Capacity vs. Mappings

The main thing to understand is the difference between a List’s and a Map’s capacities. The capacity of a list is straightforward: the number of elements we plan to store in the List.

However, in a HashMap, we need to account for another important parameter: load factor. By default, a HashMap uses a load factor set to 0.75, which means that the Map will never reach its full capacity and will increase its size when it becomes 75% full.

We saw this in our previous example, and the question is how much capacity we should allocate to store a particular number of elements. Doing it correctly will save us space and time – as we won’t waste processing power to rehash the entries in the map.

Calculating the Capacity

Now, when we know that the capacity isn’t equal to the number of mappings, we assume it will always be true. However, technically, we can set the load factor to 100%, which would create another problem: hash collisions.

There are several ways to calculate the correct capacity for a given number of mappings. Let’s review some of them.

1. Naoto Sato’s Formula

This formula is relatively easy to use and to understand:

int capacity =  (int) (number of mappings/load factor) + 1;

However, it might allocate more memory than needed for certain values. If we have a load factor of 0.75, then this method would allocate an additional space for the number of mappings that are divisible by three:

int capacity = (int) ( 6 / 0.75) + 1 = 9;

The resulting capacity is technically correct, but having eight buckets to accommodate six elements would be enough without resizing. If we’re using this formula to create a Map that needs to accommodate six elements, we will end up with sixteen elements preallocated instead. Remember that the size of the internal table in the HashMap should always be a power of two.

2. Google Guava’s Formula

This formula is similar to the previous one but uses float for the entire calculation:

int capacity = (int) ((float) numMappings / 0.75f + 1.0f);

However, if the previous formula has a problem with overestimating the capacity, this one has the opposite problem. Due to float arithmetic and rounding, a HashMap can have lesser capacity, forcing rehashing.

3. Decimal Arithmetic

These two formulas don’t have problems with roundings and imperfect float calculations. The first one works with ints directly:

int capacity = (numMappings * 4 + 2) / 3;

The second formula uses long:

int capacity =  (int) ((numMappings * 4L + 2L) / 3L);

Unfortunately, both have problems with integer overflow, which might result in negative numbers or numbers far away from optimal values.

4. Ceiling Formula

Another simple formula to calculate the capacity uses Math.ceil(float):

int capacity = (int) Math.ceil(numMappings / 0.75f);

The result might result in underestimations for specific large numbers due to precision manipulations.

5. Java 19 API

Java 19 introduced a new static factory, HashMap.newHashMap(int) takes the number of mappings and calculates the capacity transparently. This new method is a straightforward way to create a HashMap with the desired capacity without over- or underestimation.

Number of Buckets

Although we can pass the required capacity while creating a HashMap, it doesn’t mean the map will contain the specified number of buckets. For performance reasons, the number of buckets is the nearest larger or equal value that is a power of two.

For example, if we’re passing eight as capacity:

final int initialCapacity = 8;

final HashMap<String, String> map = new HashMap<>(initialCapacity);

map.put("Hello", "World");

final int actualCapacity = getCapacity(map);

System.out.println("The initial capacity is %d and the actual one is %d"

    .formatted(initialCapacity, actualCapacity));

The actual capacity would be eight:

The initial capacity is 8, and the actual one is 8

At the same time, if we request a little bit more:

final int initialCapacity = 9;

final HashMap<String, String> map = new HashMap<>(initialCapacity);

map.put("Hello", "World");

final int actualCapacity = getCapacity(map);

System.out.println("The initial capacity is %d, and the actual one is %d"

    .formatted(initialCapacity, actualCapacity));

The capacity would be the closest larger number that is a power of two, which is sixteen:

The initial capacity is 9, and the actual one is 16

Note that the method getCapacity is a custom method to get the capacity using reflection.

Conclusion

To allocate the correct number of buckets, the best and the most readable approach is to use Java 19 API to create a HashMap. However, sometimes, it’s impossible to bump the Java version due to restrictions or historical reasons.

The next best solution is to use Naoto’s formula, the only bug-free method presented above. It’s not perfectly optimal, but it doesn’t force rehashing. Another benefit is that it’s easy to remember and understand.

Overall, each application has its specific problems and possible optimizations. The best way to account for them is to use diagnostics tools, such as yCrash, to check its memory usage and inefficient collections, either in the dedicated sections or by analyzing retained heaps.

Optimizing the Capacity of a HashMap

A Faulty Example

Capacity vs. Mappings

Calculating the Capacity

1. Naoto Sato’s Formula

2. Google Guava’s Formula

3. Decimal Arithmetic

4. Ceiling Formula

5. Java 19 API

Number of Buckets

Conclusion

YOU MAY ALSO LIKE

Share your Thoughts!Cancel reply

About

Popular Topics

Troubleshooting Tools

A Faulty Example

Capacity vs. Mappings

Calculating the Capacity

1. Naoto Sato’s Formula

2. Google Guava’s Formula

3. Decimal Arithmetic

4. Ceiling Formula

5. Java 19 API

Number of Buckets

Conclusion

YOU MAY ALSO LIKE

Share your Thoughts!Cancel reply

About

Popular Topics

Troubleshooting Tools

Discover more from HeapHero – Java & Android Heap Dump Analyzer