Will your application need to handle very large volumes of data? If so, you need to take time to plan carefully, or you run the risk of encountering Out of Memory errors when your program goes live.
To successfully process large datasets in Java, there are several factors to take into account. In this article, we’ll look at planning and coding for high volumes, as well as testing and live monitoring considerations.
Types of Out of Memory Errors in Java
In Java, there are 9 different kinds of Out of Memory errors. To guard against all of them, you need to have a good understanding of the JVM memory model.
It’s tempting to only consider the heap when you’re trying to prevent Out of Memory errors. In fact, there are several memory areas you need to keep in mind:
- The Heap: Java’s central object storage area;
- Metaspace/PermGen: Class metadata, method byte code etc.;
- Direct Buffers: Buffering for fast I/O operations;
- Thread Space: Stacks for each running thread;
- JNI: Memory used by the Java Native Interface.
Additionally, you need to take into account the total RAM of the device, to avoid operating system memory issues. And finally, there is a maximum size for arrays. This may vary with the platform or JVM version, but in general, arrays can’t hold more than 231 – 8 items.
Assessing Memory Requirements for Large Datasets in Java
The first question to answer is: do you really need a large volume of data in RAM all at the same time? Why? Is there a way of avoiding this situation? There’s generally a trade-off between the speed gained by holding the data in memory, and the cost of the RAM required to do this. As with all systems design issues, we need to assess the costs and benefits of different solutions. We also need to take into account any other applications that will be running on the device, and decide whether their performance may be degraded.
We need to remember that overloading memory can result in poor performance due to overworking the garbage collector, and causing operating system paging.
Certain applications do, of course, need to hold high volumes of data in RAM. These include:
- Stock trading platforms, where it’s essential to have the very latest data available instantly;
- Simulators and some gaming applications;
- Robotics.
Most applications, however, can work successfully using a much smaller memory footprint, as the article will discuss in later sections.
It’s important to accurately estimate the amount of RAM that your solution will require. Here are some pointers that will help you:
- Java Primitives: Oracle’s breakdown of space required.
- Objects: It’s difficult to calculate this accurately. Actual sizes can be obtained at the early testing phase by using a tool such as HeapHero. We need to bear in mind:
- Each object has an overhead of around 12 bytes for the object header;
- Objects can create other objects ad infinitum, so this must be taken into account. “Shallow Heap” is the heap space occupied by the object itself, whereas “Retained Heap” is the space occupied by the object and all its children;
- Images, video and audio require a lot of memory. Here are some handy calculators to estimate sizes:
In extreme cases, where memory requirements are huge, we can consider distributed processing using a tool such as Apache Spark.
Planning
First, ask a few questions:
- What data do you need in memory at each stage of processing?
- Why?
- How long do you need it for?
- Does it need to be in a particular order, or will it be accessed randomly? If randomly, how will it be indexed? What data structure best fits these requirements?
If the data is accessed sequentially, one record at a time, is there any reason to keep all of it in memory, or can it be read from storage as needed? Bear in mind that SSD drives are much faster than HDD, although still significantly slower than RAM. Buffering can dramatically reduce latency from storage transfers, and the size of the buffers can be controlled by the program. Direct buffers improve performance considerably.
If accessed randomly, well-tuned databases generally offer good performance.
Caching a limited amount of frequently-used data can give a good trade-off between memory usage and performance.
In some cases, particularly when processing media, lightweight minimal compression algorithms can reduce memory usage without a huge reduction in performance. There are several suitable Java APIs available for this.
String deduplication can be a big memory-saver in some cases.
For critical, high-performance applications, it’s worth prototyping and benchmarking several different solutions before committing to a design. This will also help with planning the RAM requirements for the final solution.
Coding for Processing Large Datasets in Java
Two aspects that we should consider when coding are:
- The garbage collector should be able to effectively identify and remove any objects or classes that are no longer needed;
- Memory should be used efficiently.
- Garbage Collection (GC) Considerations
GC removes anything that no longer has a reference pointing to it. This happens when a variable goes out of scope. For example, a method’s local variables go out of scope when the method completes. Also, when a variable is set to null, the reference is removed, and stream objects are removed when they’re closed. For the GC to do its job, programmers must ensure that variables are defined within the correct scope, so they won’t be retained when they’re no longer needed. Further tips:
- Always close streams when they’re no longer needed;
- Re-use objects whenever possible;
- Don’t define large data structures as instance variables;
- Be wary of large static variables;
- Using the System.gc() command to manually invoke the GC is not recommended. This forces a full garbage collection, often causing latency.
- Use Memory Efficiently
Let’s look at a few strategies for using memory efficiently.
- Java primitives use much less space than objects. Remember, there is a 12-byte overhead on each object.
- Use data structures that fit and optimize your task.
- Use SoftReference for caching. If memory runs short, objects with soft references will be garbage collected, ensuring the cache doesn’t grow too large.
- If possible, load and process data from disk in batches:
- If the data is in a database, use cursors to scroll through the data as needed. Also, tune the selection to extract only the columns and rows you actually need.
- Use streams to process data sequentially, rather than loading the entire set at once. Streams can be buffered for faster access, and there is a facility for processing streams in parallel if you have sufficient CPU cores.
- For large datasets, use external rather than internal sorts.
- The FastUtil library offers a way of storing collections more efficiently, and processing them faster.
- Consider paging data to disk if it’s not needed immediately.
Testing and Tuning
When testing, it’s worth gathering statistics about memory usage, and getting accurate sizes for your data structures. The JDK offers several useful tools to help with this. The HeapHero utility very quickly gathers accurate and comprehensive data regarding heap usage.
Load testing is extremely important when you’re working with large datasets. Gathering statistics of total memory usage, heap usage and GC performance is critical for making informed tuning decisions. Along with the JDK tools, HeapHero is again invaluable for this. GCeasy is an excellent utility for gathering key information about how well GC is working. It also highlights any issues, and makes performance-tuning suggestions.
Once you have this data, you’ll be able to accurately predict memory requirements for the system when it goes live. You will also be able to make informed tuning decisions, and benchmark the effect of different tuning parameters.
You can tune various parameters using JVM runtime switches. Some of the areas to look at are:
- Setting the initial and maximum heap sizes based on load testing results;
- Tuning GC, and using the best GC algorithm for your task. The ZGC and Shenandoah algorithms generally work better for very large heap sizes;
- Setting the sizes of the metaspace and the direct buffer space;
- Enabling String deduplication.
Monitoring the Live System
Since application loads change with time, it’s a good idea to monitor production systems regularly to pre-empt Out of Memory errors and performance issues.
These tools give comprehensive stats and warnings of impending problems:
- HeapHero for monitoring heap usage;
- GCeasy for ensuring GC is working efficiently;
- yCrash for constant background monitoring, providing incident reports, warnings of developing issues and 360° diagnostic information in the event of a crash.
Conclusion
If you need to process large datasets in Java, it’s critical to keep memory considerations in mind at all stages of the project lifecycle. At the planning stage, we need to consider the memory requirements of proposed solutions, and trade off performance against memory costs. When coding, always make sure objects can be garbage collected when no longer in use, and optimize memory usage as much as possible.
At the testing stage, especially during load testing, we can gather important information to accurately predict the needs of the system in production. These can be used to tune the JVM, and ensure sufficient memory resources will be available on the live machine.
Finally, systems that deal with huge volumes of data should be regularly monitored in production, so any developing issues can be dealt with before they crash the system.

Share your Thoughts!