Streaming Data and Large Files: Strategies to Prevent Heap Issues

Page contents

Today’s applications are dealing daily with volumes of data nobody dreamed of a few years ago. IoT devices, video and audio recorders, websites, smart phones and more are constantly feeding information into a vast lake of information.

Communications systems, distributed processing, BigData analytics and live feeds are just a few of the applications that are likely to be looking in terms of terabytes when it comes to processing requirements.

Although memory capacities are huge compared to ten years ago, they are still finite. Also, in Java, inefficient use of memory can result in overloading the garbage collector, which causes serious performance issues.

It’s therefore vital when dealing with large files or data streams to plan the application carefully, looking at what data is actually needed in memory at any given moment. It’s also important to choose the most efficient data handling techniques to suit the particular task.

In this article, we’ll look at some of the common methods of reading data, and how each compares in terms of speed and memory usage. We’ll also mention some of the more advanced data handling techniques, and provide links to more information.

Dealing With Large Data Volumes: The Trade-offs

In designing any data processing solution, there are trade-offs to be taken into account. When dealing with large data, the factors we need to look at are:

Processing Speed;
Memory Usage;
Simplicity of design.

In general, holding more data in memory at one time can improve performance, although this is not always the case if the garbage collector is having to work too hard. Large buffers, or caching a large number of records, can result in better performance. This comes at a cost of greater memory usage. Using the simplest APIs can make the system more stable and easier to maintain.

We’ll bear these trade-offs in mind when looking at different file and stream processing methods.

Sample Program: Introduction

We’ll be using a sample program to benchmark different types of solutions, and compare results.

First let’s look at the program skeleton to see how it fits together. In subsequent sections, we’ll look at the individual methods, each of which uses a different file processing technique. Finally, we’ll compare the solutions in terms of processing time and memory usage.

Each of the solutions reads a text file approximately 500 MB in size, and writes it out to a second text file. At the end of each step, the program takes a heap dump for analysis, and pauses to allow us to use the jcmd utility to report on native memory usage.

import java.lang.management.ManagementFactory;
import java.util.Date;
import java.nio.file.Files;
import java.util.ArrayList;
import java .util.List;
import java.nio.file.Path;
import java.io.IOException;
import java.io.BufferedReader;
import java.io.PrintWriter;
import java.io.FileWriter;
import java.util.Scanner;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.nio.channels.FileChannel;
import java.nio.ByteBuffer;
import java.nio.file.StandardCopyOption;


public class FileSpeedDemo {
    private Date timeStarted=new Date();
    public static void main(String[] args) {
      FileSpeedDemo demo = new FileSpeedDemo();
    }

public FileSpeedDemo() {
      readFullFile();
      readLineByLine();
      readWithBuffering();
      readWithDirectBuffers();
      copyFile();   
    }
/* 
   ===========================================================
   Individual methods will be inserted here: see next sections
   ===========================================================
*/

// Take stats after each step
// ==========================
public void statistics(String dumpName) {
// Calculate time taken
   long duration = new Date().getTime()-timeStarted.getTime();
   System.out.println("Time taken: "+duration + " ms");
        
// Take heap dump
   boolean liveObjectsOnly = false; // Set to true to dump only live objects
   try {
            MBeanServer mbeanServer = ManagementFactory.getPlatformMBeanServer();
            HotSpotDiagnosticMXBean bean = ManagementFactory.newPlatformMXBeanProxy(
                    mbeanServer,
                    "com.sun.management:type=HotSpotDiagnostic",
                    HotSpotDiagnosticMXBean.class);
            String dumpFile=dumpName+".hprof";
            Files.deleteIfExists(Path.of(dumpFile));
            bean.dumpHeap(dumpFile, liveObjectsOnly);
            System.out.println("Heap dump successfully generated at: " + dumpName);

        } catch (Exception e) {
            System.out.println("Failed to generate heap dump: " + e.getMessage());          
        }
// Pause to allow native memory tracking
    System.out.println
       ("Track native memory using 'jcmd <pid> VM.native_memory summary' then press enter to continue");
    new java.util.Scanner(System.in).nextLine();
    }
}

To track native memory usage, we’ll need to run the program using the following command line switch:

java -XX:NativeMemoryTracking=detail FileSpeedDemo

At the end of each step, when the program pauses, we can obtain detailed information of both native and heap memory usage using the following command:

jcmd <PID> VM.native_memory summary|more

where <PID> is the process ID of the running Java program.

This gives a breakdown of memory usage. Part of its typical output looks like this:

Native Memory Tracking:

(Omitting categories weighting less than 1KB)

Total: reserved=3527921KB, committed=265673KB
-                 Java Heap (reserved=2066432KB, committed=189440KB)
                            (mmap: reserved=2066432KB, committed=189440KB)

-                     Class (reserved=1048749KB, committed=557KB)
                            (classes #1657)

Comparing Commonly-used I-O Techniques

In this section, we’ll use the sample program to illustrate five different file reading techniques, and see how they perform.

1. Reading The Entire File With One Instruction

This method is often used as the simplest method of reading data from a file. We would almost never use this when dealing with large volumes. It’s useful, however, for small configuration files that need to be retained in memory and referred to as needed.

This is illustrated by the following method in the sample program:

public void readFullFile() {
// Read the full file, then write to a new file
      timeStarted=new Date();
      List<String> list = new ArrayList();
      try {
          list = Files.readAllLines(Path.of("textfile.txt"));
          Files.write(Path.of("copyfile.txt"),list);
          }
      catch(IOException e) {
          System.out.println("Error in readFullFile method");
          System.out.println(e.toString());
          }
 
// Create statistics      
     statistics("FSD1_dump");

// Release memory
      list=null;

// Ensure garbage is collected before next phase
      System.gc();
      }

The Java class java.nio.file.Files includes a method named readAllLines, which copies the entire contents of the file into a List object.

This has the advantage of being very simple to code, and it can be moderately fast, but it is extremely memory-hungry.

When run, the results were as follows:

Time Taken:	18.261
Total Memory Usage (KB)	1,189,113
Heap Usage (KB)	1,077,248

Analyzing the dump with the HeapHero utility shows that the resulting list is enormous:

Fig: HeapHero Shows Very Large ArrayList

2. Reading Line by Line

Reading and processing the file one line at a time gives huge memory gains. It may be slower, and only one line can be accessed at a time: we can’t refer back to previous lines.

The simplest way of doing this is to use the class java.util.Scanner, which can be used to read from a file, from the network or from the keyboard.

The sample method looks like this:

// ==========================
// Process one line at a time
// ==========================
public void readLineByLine(){
// ==========================
      timeStarted=new Date();
      try {
          Scanner scanner = new Scanner(Path.of("textfile.txt"));
          PrintWriter pw = new PrintWriter(new FileWriter("copyfile.txt"));
          while (scanner.hasNext()) 
             pw.println(scanner.next());
// Create statistics      
          statistics("FSD2_dump");
// Release memory
          scanner.close();
          pw.close();
          scanner=null;
          pw=null;
          }
      catch(IOException e) {
          System.out.println("Error copying line by line" + e.toString());
          }
  
// Ensure garbage is collected before next phase
      System.gc();

The results were as follows:

Time Taken:	75.61
Total Memory Usage (KB)	182,513
Heap Usage (KB)	106,496

As we see, the memory savings are huge, but the time taken is much longer. This method is too slow for dealing with very large volumes.

3. Using Buffering

Buffering, where chunks of the file are read from disk into a buffer area, and from there, processed line by line, give considerable performance gains. This is because disk or network I/O can proceed in the background while each line is processed.

We can trade off memory vs performance by adjusting the size of the buffers. Large buffers are faster but use more memory.

The sample method looks like this:

public void readWithBuffering() {
      timeStarted=new Date();
      try {
          BufferedReader reader = Files.newBufferedReader(Path.of("textfile.txt"));
          PrintWriter pw = new PrintWriter(new FileWriter("copyfile.txt"));
          String thisLine="";
// Loop through lines until null line is read
          while(true) {
                      thisLine=reader.readLine();
                      if(thisLine==null)
                          break;
                      pw.println(thisLine);
                      }
// Create statistics      
      statistics("FSD3_dump");
// Release memory
          reader.close();
          pw.close();
          reader=null;
          pw=null;
          }
      catch(IOException e) {
          System.out.println("Error copying with buffer" + e.toString());
          }
         
// Ensure garbage is collected before next phase
      System.gc();
      }

The results were as follows:

Time Taken:	17.264
Total Memory Usage (KB)	145,404
Heap Usage (KB)	70,656

This method is adequate for most applications, but can be improved on for very large volumes, as we’ll see.

4. File Channels and Direct Buffers

File channels are a flexible and efficient way of dealing with I/O, and have the advantage that they can be used with byte buffers. Byte buffers may be created as direct buffers, which are managed by the underlying operating system. They save considerable time, since data can be manipulated directly in the area used by the operating system for I/O, rather than having to be copied into Java memory.

By using file channels, we can read, manipulate and write data in the buffer. This method is efficient regarding both performance and memory usage.

The sample method looks like this:

public void readWithDirectBuffers(){
      timeStarted=new Date();
      try {  
          FileChannel inChannel=new FileInputStream("Textfile.txt").getChannel();
          FileChannel outChannel=new FileOutputStream("copyfile.txt").getChannel();
          ByteBuffer bb = ByteBuffer.allocateDirect(8192);
          while(inChannel.read(bb) != -1) {
               bb.flip();
               outChannel.write(bb);
               bb.clear();    
          }

// Create statistics      
         statistics("FSD4_dump");
// Release memory
      inChannel.close();
      outChannel.close();
      inChannel=null;
      outChannel=null;
      bb=null;
      }
      catch(IOException e) {
          System.out.println("Error copying with direct buffer" + e.toString());
          }
// Ensure garbage is collected before next phase
      System.gc();
      }

The results were as follows:

Time Taken:	13.207
Total Memory Usage (KB)	82,769
Heap Usage (KB)	10,240

This is an excellent solution when dealing with very large files or streams. The data can be manipulated as needed between the read and the write, and it is both fast and memory-efficient.

As with buffered readers, we can trade off speed vs memory usage by adjusting the size of the buffer.

5. Copying Files

If the data simply needs to be copied from one place to another without being processed or amended, the java.nio.file.Files class has a copy facility. This is highly efficient regarding both speed and memory. It can be useful in situations where we simply want to redirect data without changing it.

The sample method looks like this:

public void copyFile(){
      timeStarted=new Date();
      try {
          Files.copy(Path.of("Textfile.txt"),Path.of("copyfile.txt"), 
                              StandardCopyOption.REPLACE_EXISTING);
          }
       catch(IOException e) {
          System.out.println("Error using Files.copy" + e.toString());
          }
// Create statistics      
      statistics("FSD5_dump");
// Release memory

// Ensure garbage is collected before next phase
      System.gc();
      }

The results were as follows:

Time Taken:	7.241
Total Memory Usage (KB)	80,644
Heap Usage (KB)	8,192

6. Comparing the Solutions

We can summarize the five solutions as follows:

Solution	Full File Read	Scanner	Buffered Reader	Direct Buffer	Copy
Time Taken	18.261	75.61	17.274	13.207	7.241
Heap Memory (KB)	1,077,248	106,496	70,656	10,240	8,192
Total Memory (KB)	1,189,113	182,513	145,404	82,769	80,644
Notes	Suitable for small files only	Slow	Can trade off speed against memory usage by adjusting buffer size	Can trade off speed against memory usage by adjusting buffer size; flexible access to data in buffer	Fast, but no ability to process or adjust data

Conclusion

When working with large data files or streams, it’s essential to choose the right method for the specific task. We need to trade off memory footprint against performance, while keeping our solutions simple and easy to maintain.

Proper planning and a good knowledge of different file processing techniques is invaluable in developing cost-effective solutions.

Streaming Data and Large Files: Strategies to Prevent Heap Issues

Dealing With Large Data Volumes: The Trade-offs

Sample Program: Introduction

Comparing Commonly-used I-O Techniques

1. Reading The Entire File With One Instruction

2. Reading Line by Line

3. Using Buffering

4. File Channels and Direct Buffers

5. Copying Files

6. Comparing the Solutions

Other Solutions

Conclusion

YOU MAY ALSO LIKE

Share your Thoughts!Cancel reply

The Unseen Memory Leak: How ThreadLocal Variables Can Bring Down Your Application

Java Memory Leaks: The Definitive Guide to Causes, Detection & Fixes

Unbounded Caches, Static Collections, and Unclosed Resources: The 3 Killer Anti-Patterns Causing Memory Leaks

About

Popular Topics

Troubleshooting Tools

Dealing With Large Data Volumes: The Trade-offs

Sample Program: Introduction

Comparing Commonly-used I-O Techniques

1. Reading The Entire File With One Instruction

2. Reading Line by Line

3. Using Buffering

4. File Channels and Direct Buffers

5. Copying Files

6. Comparing the Solutions

Other Solutions

Conclusion

YOU MAY ALSO LIKE

Share your Thoughts!Cancel reply

About

Popular Topics

Troubleshooting Tools

Discover more from HeapHero – Java & Android Heap Dump Analyzer