Streaming Data and Large Files: Strategies to Prevent Heap Issues

Today’s applications are dealing daily with volumes of data nobody dreamed of a few years ago. IoT devices, video and audio recorders, websites, smart phones and more are constantly feeding information into a vast lake of information. 

Communications systems, distributed processing, BigData analytics and live feeds are just a few of the applications that are likely to be looking in terms of terabytes when it comes to processing requirements.

Although memory capacities are huge compared to ten years ago, they are still finite. Also, in Java, inefficient use of memory can result in overloading the garbage collector, which causes serious performance issues.

It’s therefore vital when dealing with large files or data streams to plan the application carefully, looking at what data is actually needed in memory at any given moment. It’s also important to choose the most efficient data handling techniques to suit the particular task.

In this article, we’ll look at some of the common methods of reading data, and how each compares in terms of speed and memory usage. We’ll also mention some of the more advanced data handling techniques, and provide links to more information.

Dealing With Large Data Volumes: The Trade-offs

In designing any data processing solution, there are trade-offs to be taken into account. When dealing with large data, the factors we need to look at are:

  • Processing Speed;
  • Memory Usage;
  • Simplicity of design.

In general, holding more data in memory at one time can improve performance, although this is not always the case if the garbage collector is having to work too hard. Large buffers, or caching a large number of records, can result in better performance. This comes at a cost of greater memory usage. Using the simplest APIs can make the system more stable and easier to maintain.

We’ll bear these trade-offs in mind when looking at different file and stream processing methods.

Sample Program: Introduction

We’ll be using a sample program to benchmark different types of solutions, and compare results.

First let’s look at the program skeleton to see how it fits together. In subsequent sections, we’ll look at the individual methods, each of which uses a different file processing technique. Finally, we’ll compare the solutions in terms of processing time and memory usage.

Each of the solutions reads a text file approximately 500 MB in size, and writes it out to a second text file. At the end of each step, the program takes a heap dump for analysis, and pauses to allow us to use the jcmd utility to report on native memory usage.

import java.lang.management.ManagementFactory;
import java.util.Date;
import java.nio.file.Files;
import java.util.ArrayList;
import java .util.List;
import java.nio.file.Path;
import java.io.IOException;
import java.io.BufferedReader;
import java.io.PrintWriter;
import java.io.FileWriter;
import java.util.Scanner;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.nio.channels.FileChannel;
import java.nio.ByteBuffer;
import java.nio.file.StandardCopyOption;


public class FileSpeedDemo {
    private Date timeStarted=new Date();
    public static void main(String[] args) {
      FileSpeedDemo demo = new FileSpeedDemo();
    }

public FileSpeedDemo() {
      readFullFile();
      readLineByLine();
      readWithBuffering();
      readWithDirectBuffers();
      copyFile();   
    }
/* 
   ===========================================================
   Individual methods will be inserted here: see next sections
   ===========================================================
*/

// Take stats after each step
// ==========================
public void statistics(String dumpName) {
// Calculate time taken
   long duration = new Date().getTime()-timeStarted.getTime();
   System.out.println("Time taken: "+duration + " ms");
        
// Take heap dump
   boolean liveObjectsOnly = false; // Set to true to dump only live objects
   try {
            MBeanServer mbeanServer = ManagementFactory.getPlatformMBeanServer();
            HotSpotDiagnosticMXBean bean = ManagementFactory.newPlatformMXBeanProxy(
                    mbeanServer,
                    "com.sun.management:type=HotSpotDiagnostic",
                    HotSpotDiagnosticMXBean.class);
            String dumpFile=dumpName+".hprof";
            Files.deleteIfExists(Path.of(dumpFile));
            bean.dumpHeap(dumpFile, liveObjectsOnly);
            System.out.println("Heap dump successfully generated at: " + dumpName);

        } catch (Exception e) {
            System.out.println("Failed to generate heap dump: " + e.getMessage());          
        }
// Pause to allow native memory tracking
    System.out.println
       ("Track native memory using 'jcmd <pid> VM.native_memory summary' then press enter to continue");
    new java.util.Scanner(System.in).nextLine();
    }
}

To track native memory usage, we’ll need to run the program using the following command line switch:

java -XX:NativeMemoryTracking=detail FileSpeedDemo

At the end of each step, when the program pauses, we can obtain detailed information of both native and heap memory usage using the following command:

jcmd <PID> VM.native_memory summary|more

where <PID> is the process ID of the running Java program.

This gives a breakdown of memory usage. Part of its typical output looks like this:

Native Memory Tracking:

(Omitting categories weighting less than 1KB)

Total: reserved=3527921KB, committed=265673KB
-                 Java Heap (reserved=2066432KB, committed=189440KB)
                            (mmap: reserved=2066432KB, committed=189440KB)

-                     Class (reserved=1048749KB, committed=557KB)
                            (classes #1657)

Comparing Commonly-used I-O Techniques

In this section, we’ll use the sample program to illustrate five different file reading techniques, and see how they perform.

1. Reading The Entire File With One Instruction

This method is often used as the simplest method of reading data from a file. We would almost never use this when dealing with large volumes. It’s useful, however, for small configuration files that need to be retained in memory and referred to as needed.

This is illustrated by the following method in the sample program:

public void readFullFile() {
// Read the full file, then write to a new file
      timeStarted=new Date();
      List<String> list = new ArrayList();
      try {
          list = Files.readAllLines(Path.of("textfile.txt"));
          Files.write(Path.of("copyfile.txt"),list);
          }
      catch(IOException e) {
          System.out.println("Error in readFullFile method");
          System.out.println(e.toString());
          }
 
// Create statistics      
     statistics("FSD1_dump");

// Release memory
      list=null;

// Ensure garbage is collected before next phase
      System.gc();
      }

The Java class java.nio.file.Files includes a method named readAllLines, which copies the entire contents of the file into a List object.

This has the advantage of being very simple to code, and it can be moderately fast, but it is extremely memory-hungry.

When run, the results were as follows:

Time Taken:18.261
Total Memory Usage (KB)1,189,113
Heap Usage (KB)1,077,248

Analyzing the dump with the HeapHero utility shows that the resulting list is enormous:

Fig: HeapHero Shows Very Large ArrayList

2. Reading Line by Line 

Reading and processing the file one line at a time gives huge memory gains. It may be slower, and only one line can be accessed at a time: we can’t refer back to previous lines.

The simplest way of doing this is to use the class java.util.Scanner, which can be used to read from a file, from the network or from the keyboard.

The sample method looks like this:

// ==========================
// Process one line at a time
// ==========================
public void readLineByLine(){
// ==========================
      timeStarted=new Date();
      try {
          Scanner scanner = new Scanner(Path.of("textfile.txt"));
          PrintWriter pw = new PrintWriter(new FileWriter("copyfile.txt"));
          while (scanner.hasNext()) 
             pw.println(scanner.next());
// Create statistics      
          statistics("FSD2_dump");
// Release memory
          scanner.close();
          pw.close();
          scanner=null;
          pw=null;
          }
      catch(IOException e) {
          System.out.println("Error copying line by line" + e.toString());
          }
  
// Ensure garbage is collected before next phase
      System.gc();

The results were as follows:

Time Taken:75.61
Total Memory Usage (KB)182,513
Heap Usage (KB)106,496

As we see, the memory savings are huge, but the time taken is much longer. This method is too slow for dealing with very large volumes.

3. Using Buffering

Buffering, where chunks of the file are read from disk into a buffer area, and from there, processed line by line, give considerable performance gains. This is because disk or network I/O can proceed in the background while each line is processed.

We can trade off memory vs performance by adjusting the size of the buffers. Large buffers are faster but use more memory.

The sample method looks like this:

public void readWithBuffering() {
      timeStarted=new Date();
      try {
          BufferedReader reader = Files.newBufferedReader(Path.of("textfile.txt"));
          PrintWriter pw = new PrintWriter(new FileWriter("copyfile.txt"));
          String thisLine="";
// Loop through lines until null line is read
          while(true) {
                      thisLine=reader.readLine();
                      if(thisLine==null)
                          break;
                      pw.println(thisLine);
                      }
// Create statistics      
      statistics("FSD3_dump");
// Release memory
          reader.close();
          pw.close();
          reader=null;
          pw=null;
          }
      catch(IOException e) {
          System.out.println("Error copying with buffer" + e.toString());
          }
         
// Ensure garbage is collected before next phase
      System.gc();
      }

The results were as follows:

Time Taken:17.264
Total Memory Usage (KB)145,404
Heap Usage (KB)70,656

This method is adequate for most applications, but can be improved on for very large volumes, as we’ll see.

4. File Channels and Direct Buffers

File channels are a flexible and efficient way of dealing with I/O, and have the advantage that they can be used with byte buffers. Byte buffers may be created as direct buffers, which are managed by the underlying operating system. They save considerable time, since data can be manipulated directly in the area used by the operating system for I/O, rather than having to be copied into Java memory.

By using file channels, we can read, manipulate and write data in the buffer. This method is efficient regarding both performance and memory usage.

The sample method looks like this:

public void readWithDirectBuffers(){
      timeStarted=new Date();
      try {  
          FileChannel inChannel=new FileInputStream("Textfile.txt").getChannel();
          FileChannel outChannel=new FileOutputStream("copyfile.txt").getChannel();
          ByteBuffer bb = ByteBuffer.allocateDirect(8192);
          while(inChannel.read(bb) != -1) {
               bb.flip();
               outChannel.write(bb);
               bb.clear();    
          }

// Create statistics      
         statistics("FSD4_dump");
// Release memory
      inChannel.close();
      outChannel.close();
      inChannel=null;
      outChannel=null;
      bb=null;
      }
      catch(IOException e) {
          System.out.println("Error copying with direct buffer" + e.toString());
          }
// Ensure garbage is collected before next phase
      System.gc();
      }

The results were as follows:

Time Taken:13.207
Total Memory Usage (KB)82,769
Heap Usage (KB)10,240

This is an excellent solution when dealing with very large files or streams. The data can be manipulated as needed between the read and the write, and it is both fast and memory-efficient.

As with buffered readers, we can trade off speed vs memory usage by adjusting the size of the buffer.

5. Copying Files

If the data simply needs to be copied from one place to another without being processed or amended, the java.nio.file.Files class has a copy facility. This is highly efficient regarding both speed and memory. It can be useful in situations where we simply want to redirect data without changing it.

The sample method looks like this:

public void copyFile(){
      timeStarted=new Date();
      try {
          Files.copy(Path.of("Textfile.txt"),Path.of("copyfile.txt"), 
                              StandardCopyOption.REPLACE_EXISTING);
          }
       catch(IOException e) {
          System.out.println("Error using Files.copy" + e.toString());
          }
// Create statistics      
      statistics("FSD5_dump");
// Release memory

// Ensure garbage is collected before next phase
      System.gc();
      }

The results were as follows:

Time Taken:7.241
Total Memory Usage (KB)80,644
Heap Usage (KB)8,192

6. Comparing the Solutions

We can summarize the five solutions as follows:

SolutionFull File ReadScannerBuffered ReaderDirect BufferCopy
Time Taken18.26175.6117.27413.2077.241
Heap Memory (KB)1,077,248106,49670,65610,2408,192
Total Memory (KB)1,189,113182,513145,40482,76980,644
NotesSuitable for small files onlySlowCan trade off speed against memory usage by adjusting buffer sizeCan trade off speed against memory usage by adjusting buffer size; flexible access to data in bufferFast, but no ability to process or adjust data

Other Solutions

The Java language is continually evolving, and newer versions offer better and faster ways of manipulating large datasets.

Examples include:

  • Stream Gatherers:  These simplify complex processing of data streams, while offering considerable gains. For more information, see this article: Powerful Data Processing with Stream Gatherers.
  • Compressed Data: The package java.util.zip offers support for working with compressed data formats. For more information, see the Oracle documentation.

Conclusion

When working with large data files or streams, it’s essential to choose the right method for the specific task. We need to trade off memory footprint against performance, while keeping our solutions simple and easy to maintain.

Proper planning and a good knowledge of different file processing techniques is invaluable in developing cost-effective solutions.

Share your Thoughts!

Up ↑

Index

Discover more from HeapHero – Java & Android Heap Dump Analyzer

Subscribe now to keep reading and get access to the full archive.

Continue reading