Strategies for Handling Terabytes of Data in Java: Alternatives to Massive In-Memory Arrays

Over the last decade, data volumes have increased exponentially, so that we often refer to the ‘data lake’.  More users, the IoT (Internet of Things), data collection from mobile phones and more have all contributed to this. 

Applications may now need to process what’s often referred to as Big Data, where volumes may be measured in terabytes or even petabytes. Possibly in another ten years we may view a terabyte as a small amount of data! Should we be rethinking our data processing model? Strategies that worked well for megabytes may not be practical with today’s volumes.

At one time, holding data in memory arrays was seen as a good performance-enhancing technique. However, when we’re working with Big Data, it’s time to start looking at Java large array alternatives.

JVM Performance Risks of Ultra-Large Arrays

Why should we avoid ultra-large arrays?

Here are a few reasons.

  • There is a limit to how many items can be stored in a Java array. Array pointers are always integers, which means they can store numbers up to  231-1. If we try to create an array bigger than this, the JVM throws an exception: OutOfMemoryError: Requested Array Size. In practice, we can’t store more than around 2 billion items in an array. 
  • They don’t scale well. Arrays are created with a fixed size. They may work fine this year, but in a few years’ time, our traffic volumes may have increased, and the system will have to be amended to cope.
  • We may hope to enhance performance, reducing the number of I/O operations by holding more data in memory. However, this can be counter-productive. The larger the heap memory, the harder the garbage collector needs to work to keep it clean. This can result in high CPU usage and possibly long pauses caused by Stop the World events. This can often be more detrimental to performance than reading from storage. To some extent this can be mitigated by choosing the right GC algorithm (ZGC and Shenandoah both work well with massive heaps), but it’s always going to be a factor. 
  • There is another reason garbage collection may not work well with ultra-large objects, such as big arrays.  In GC algorithms capable of handling large heap sizes,  such as G1GC, ZGC and Shenandoah, the heap is split into regions. If an object occupies more than half of a region, it’s known as a humongous object. These require special handling. In generational GC, humongous objects are moved directly to the old generation. This means, even if they are short-lived, they won’t be cleared until a full GC event. A bigger problem is that a full  region is used only for the one object, even if the object doesn’t occupy all of the region. This can result in OutOfMemory errors caused by fragmentation, even when the heap is not fully used.
  • High memory usage can cause the operating system to do excessive paging, which slows down the entire system.
  • Cloud computing costs are calculated on resource usage. If you’re using too much memory, it’s going to up your monthly charges. If the GC is using too much CPU, that’s also going to increase costs.
  • If memory is over-used, we increase the danger of Out of Memory crashes. Big Data memory production downtime is not something we need.

Top 5 Java Large Array Alternatives for Terabyte-Scale 

Ok, we’ve agreed humungous arrays aren’t a good idea. But what should we do instead? To answer this, let’s ask ourselves some questions.

1. Design Patterns for Memory-Efficient Data Processing 

Many of us are still stuck in turn-of-the-century mindsets when we’re designing IT systems. That’s a bit like choosing to use steam engines in 1926! Successful designers of the era chose petrol or electric engines instead.

Let’s compare the resources that were available for high-end servers in 2000 with what is likely to be available now.

 20002025
Capacity: High End Server
Cores per CPU164
CPUs per System8128
RAM capacity4GB4TB
Per SSD drive8GB100TB
Per HDD Drive100GB36TB
Approximate Cost USD
Cost per core300-100045-60
1 GB RAM800-10002-5
1 GB SSD500-10000.05-0.10
1 GB HDD5-100.015-0.03

The cost and capacity of resources has improved dramatically. But at the same time, the volume of data to be processed has increased even more, so we still need to treat resources as finite.

Although all resources have increased in capacity and reduced in cost, the most spectacular performer is the SSD drive. In the year 2000, they were very expensive, and only used by the very top-end servers. SSD technology is now even used in small devices such as mobile phones. They are around 40 times faster than HDDs for sequential reads, and up to 10 000 times faster for random reads.

Add to this the fact that multi-core processors, as well as multi-processor machines, are now the norm. I/O can take place in the background while processing continues concurrently, so I/O is no longer the bottleneck that it used to be.

Our designs should always bear this in mind.

  • 2000 thinking: Read as much as you can into memory because RAM is much faster, process it, then output results.
  • 2026 Big Data thinking: Read a small chunk, process it, get rid of it from memory. If possible, carry out tasks concurrently.

At the beginning, we need to decide whether the data must be processed randomly, sequentially or via an index. We also need to decide what really needs to stay in memory long-term, and why. This helps us to decide on the right solution, which might be:

  • Streaming: Data is processed sequentially, one record at a time. Fast I/O buffers reduce the need to wait for I/O, which can take place concurrently. This is ideal for situations where random access to the data isn’t needed. If necessary, the data can be filtered or aggregated as it’s read, allowing us to only store summarized data in memory.
  • Chunk Processing: The data is broken down into sizes that conveniently fit into memory. This is especially useful for large sorts, since the chunks can be sorted individually, written to storage, and merged at the end. This technique is used for most Big Data sorting.  It can also be used for aggregating data.
  • Caching: Instead of storing an entire data set for fast look-up, it’s often better to keep a cache of only the most frequently-used data. Caches should always be bounded in size and have a working eviction policy to prevent them growing beyond limits. Google’s Guava open source library is a simple way of managing a fast, efficient cache.

2. Maximize Performance with Proven Data Technologies

Vast in-memory data stores may be a sign our application is trying to do something that could be handled much better by application software. Both commercial and open source software offer tried-and-tested, optimized solutions for standard tasks. These include:

  • Database software: Today’s database software is both fast and powerful. Querying, filtering and aggregating is handled efficiently, and cursors allow us to scroll through results one row at a time. We can choose either relational or NoSQL databases to fit our requirements.
  • Key/value stores: Software such as Apache’s Ignite offer efficient use of key/value pairs.
  • External Sorting: For very large data sets, using an external application for sorting is usually more efficient than in-memory sorts. Most cloud providers offer facilities for sorting Big Data.

3. Optimizing Object Footprint with Specialized Data Structures 

We may have been told that arrays are more memory-efficient than complex data structures. This is true in theory, since arrays store the data only, whereas complex structures store navigational information as well. But it’s not always true in practice. Let’s take an extreme example. A company has 600 employees, identified by a 6-digit employee number. If, for some reason, we needed to store all these employees in memory, we could choose to store them in an array indexed by their employee number:

Employee[] emp = new Employee[1000000];
Public Employee getEmployee(int empNo) {
return emp[empNo];

Obviously, this is highly inefficient, since we’ve reserved space for 1000000 employees when we only need to store 600. We’d be much better to use a hash map, using the employee number as the key. Even though the hash map has some overhead, it’s still going to save a lot of space. Although this example borders on the ridiculous, it does illustrate the fact that it may be more efficient to use the right data structure. Java Collections offer a full range of data structures suited to different tasks, so it’s always worth checking if there is a better solution than just using an array.

4. Off-Heap Memory and Direct Buffer Allocation

Many programmers think of Java memory and the heap being almost synonymous, but they’re not. If you’re interested, this video explains JVM memory management very clearly. For processing large chunks of data read from storage, the Java ByteBuffer class has an allocateDirect()  method that creates an off-heap buffer. 

Is there any advantage to this, since it still occupies JVM memory? There can be. It takes a load off the garbage collector, since only the reference to the buffer is stored on the heap. The large block of data is not shifted around by the GC during garbage collection. At the same time, direct buffers result in much faster I/O, since the data is not copied from native memory to heap memory.

The code below shows how to create a 1 GB direct byte buffer:

import java.nio.ByteBuffer;
public class HugeBufferExample {
public static void main(String[] args) {
int oneGigabyte = 1024 * 1024 * 1024;
try {
ByteBuffer buffer = ByteBuffer.allocateDirect(oneGigabyte);
System.out.println("Buffer allocated: " + buffer);
} catch (OutOfMemoryError e) {
System.out.println("Failed to allocate buffer: "
+ e.getMessage());
}
}
}

Although it solves the garbage collection issues related to large objects, this does have its shortcomings:

  • Since the allocateDirect() method takes an integer as the size parameter, the maximum size of the buffer is 2GB, or the largest number that can be stored in an integer.
  • Over-use of this solution could result in undue pressure on the operating system’s memory.

Another technique is to use memory mapped files. These behave like memory, and can be accessed randomly, but they’re actually stored in the operating system’s virtual memory and backed by a file. The operating system can page sections of the file in and out of RAM as needed.

Here is a sample of the code needed to create a memory-mapped file:

import java.io.RandomAccessFile;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
public class MappedExample {
public static void main(String[] args) throws Exception {
long size = 1024L * 1024 * 1024 * 1024; // 1 TB
try (RandomAccessFile file = new RandomAccessFile("large.dat", "rw");
FileChannel channel = file.getChannel()) {
MappedByteBuffer buffer =
channel.map(FileChannel.MapMode.READ_WRITE, 0, size);
System.out.println("Mapped 1TB file into memory space");
}
}
}

5. Scaling Beyond the Monolith with Distributed Big Data Frameworks

The thinking here is to use technology designed for processing large-scale data, rather than storing it in the heap. Cloud services and data management libraries are evolving all the time to provide better options for processing Big Data. Here are a few that are worth looking into.

  • Reactive Streams get rid of the necessity to store large amounts of intermediate data where stream providers may produce data at a different rate to stream consumers. Examples include Project Reactor and RxJava.
  • Distributed Solutions. Solutions such as Apache Hadoop and Apache Spark allow us to scale our tasks horizontally, easily splitting large tasks over several servers.
  • High-Volume Streaming Solutions such as Kafka and Amazon Kinesis allow us to collect, analyze and process Big Data streams efficiently.

Conclusion

Data volumes are growing annually, even faster than the development of hardware suitable to process them. Solutions that worked well for smaller data sets need rethinking when we’re processing Big Data.

Ultra-large arrays can be counterproductive, sometimes killing performance rather than enhancing it. In this article, we’ve looked at several Java large array alternatives to make applications perform and scale better for high volumes.

From rethinking our design to understanding and using Big Data technology, we’ve seen a range of solutions. Let’s make our Big Data solutions fly!

Share your Thoughts!

Up ↑

Index

Discover more from HeapHero – Java & Android Heap Dump Analyzer

Subscribe now to keep reading and get access to the full archive.

Continue reading