Advanced Heap Dump Analysis Techniques

Page contents

When you confront OutOfMemoryError or memory problems we tend to capture & analyze Heap Dumps. Analyzing heap dumps is an intimidating task. There are several challenges with heap dumps starting from capturing it, gathering supporting artifacts for analysis (such as native memory utilization, GC Log,…), transmitting large heap dump files from your production servers to the local machine, what aspects to look at in the report. This post intends to share some of the advanced heap dump analysis techniques that will facilitate analyzing heap dumps more effectively.

Before you continue to read about advanced heap dump analysis techniques in this post, we would recommend you read this post first: Common Memory Leaks in Java & How to Fix Them?, where we talk about basic heap dump analysis techniques.

Video

In this video, our Architect Ram Lakshmanan has unpacked advanced strategies for analyzing heap dumps, demonstrated how to query them with OQL, automate investigations using APIs, and trace references with precision. He also shared practical guidance on correlating heap dumps with GC logs, thread dumps, and NMT to pinpoint the root cause of memory issues faster.

1. REST API

Heap Dump files tend to be large in size and it also carries sensitive information. Typically, once they are captured, heap dumps are downloaded from the production server to your local machine if there are no firewall restriction. On the other hand, if there are firewall restrictions, heap dumps are downloaded to some shared storage and from their Heap Dump are downloaded to the local machine for analysis. This is a cumbersome & tedious process. Instead, you can use HeapHero’s REST API. Using this API, you can directly push the Heap Dump from your production environment to the HeapHero server. This REST API creates heap dump analysis reports in HeapHero Server, which can be accessed from your web browser. This avoids the tediousness and complexity in downloading and analyzing Heap Dumps on your local machine.

2. Downloading from Remote Storage

Fig: HeapHero can download Heap Dump from remote location & do analysis

Sometimes your support team or external customers might have placed Heap Dumps in shared drives, object storages (like AWS S3, GCP,…) or HTTPS Location. From these remote storage locations, you have to download the heap dumps to your local machine and do the analysis. Instead, you can use HeapHero’s remote location feature. In the HeapHero dashboard you can specify the remote storage location of your Heap Dump file, as shown in the above figure. HeapHero tool will automatically download the Heap Dump file from this remote location and do the analysis for you.

3. Masking Sensitive Information

Fig: Heapdump before Sanitization

Fig: Heapdump after Sanitization

Heap Dump is basically a snapshot of all the objects in your memory. Our applications to process sensitive information such as Credit Card Numbers, Social Security Numbers, VAT Numbers and PII data (Email Addresses, Phone Number…). Thus, several enterprises classify Heap Dump as confidential data. Thus, organizations dealing with Heap Dump carry the risk of protecting them properly. One important thing to note here is, when engineers analyze Heap Dumps they will get to see this sensitive information in the reports “as is”. It means they can see the full credit card numbers, SSN and other sensitive information.

HeapHero provides a powerful Heap Dump Sanitization feature. When you turn ON this feature all the raw data in the heap dump file will be replaced with *** in the raw heap dump file itself. So your engineers wouldn’t be able to see sensitive information when analyzing the heap dump. Even if your heap dump ends in the wrong hands, they wouldn’t be able to do much. See the above screenshots to see how sanitized data is reported in the HeapHero reports. Thus when engineers in your organization analyze the Heap Dump file, no sensitive data will be visible to him. It makes the analysis quite secure.

4. 360° Analysis

There are 9 types of OutOfMemoryError. Only 2 types of OutOfMemoryError (i.e. ‘java.lang.OutOfMemoryError: Java heap space‘, ‘java.lang.OutOfMemoryError: GC Overhead Limit Exceeded‘) can be solved through Heap Dumps. Even to diagnose these 2 types of OutOfMemoryError, you will require Garbage Collection Logs besides Heap Dump. To troubleshoot the remaining 7 types of OutOfMemoryErrors, you need additional troubleshooting artifacts. Below table summarizes what artifacts you will require to do the analyze each type of OutOfMemoryError

Type of OutOfMemoryError	Required Artifacts
OutOfMemoryError: Java Heap Space	GC Log, Heap Dump
OutOfMemoryError: GC overhead limit exceeded	GC Log, Heap Dump
OutOfMemoryError: Unable to create native threads	Thread Dump
OutOfMemoryError: Metaspace	GC Log, Verbose Class Loading Log
OutOfMemoryError: PermGen Space	GC Log, Verbose Class Loading Log
OutOfMemoryError: Kill process or sacrifice child	dmesg
OutOfMemoryError: Requested array size exceeds VM limit	Application Log
OutOfMemoryError: Direct buffer Memory	Application Log, Native Memory Tracking

Thus, you may want to leverage the yc-360 open-source script. When triggered, this script captures 16 different troubleshooting artifacts from your application that includes all the artifacts mentioned in the above table. This comprehensive approach facilitates you not only to troubleshoot all sorts of OutOfMemoryError problems but also all sorts of JVM problems.

5. Shift Left Strategy: CI/CD Integration

Most of the OutOfMemoryError or Memory bottlenecks are found only in the production environment and they aren’t uncovered in the pre-prod environments. It’s because of following reasons:

a. Synthetic Data != Production Data: The data sets used in performance labs are often artificial, scrubbed, or scaled-down versions of production data. They miss the unpredictable diversity, skew, and growth patterns of real-world usage, which hide memory bottlenecks.

b. Performance Environment = Production Environment: Pre-prod setups usually run on smaller, less powerful machines, different OS versions or different configurations. Subtle differences in CPU, memory, or JVM settings can mask memory pressure that only surfaces under true production load.

c. Lack of Long Running Tests: Most performance tests are executed for a few hours at best. Memory leaks and allocation stalls that emerge only after a few days or weeks of continuous execution never get a chance to reveal themselves.

d. Absence of Real-World Chaos: Pre-production tests rarely simulate unexpected spikes, network latencies, or concurrent user surges. These chaotic conditions amplify memory contention, which explains why they frequently appear first in production.

Even though these are valid reasons, but I would like to present the case that memory arbitration do happen in our performance lab, but at an acute scale and they aren’t good enough to catch our attention because we monitor only Macro Metrics such as CPU consumption, Memory Consumption & Response Time in our performance lab. Thus along with these Macro-Metrics if we can start to monitor Memory related Micro-Metrics such as:

a. GC Behavior Pattern

b. Object Creation Rate

c. GC Throughput

d. GC Pause Time (Avg & Max)

Garbage Collection heavily influences the application’s performance, studying the Garbage Collection behavior will facilitate you to forecast the memory related bottlenecks.

Fig: Garbage Collection Behavior of a Healthy Application

The above graph shows the GC behavior of a healthy application. You can see a beautiful saw-tooth GC pattern. You can notice that when the heap usage reaches ~5.8GB, ‘Full GC’ event (red triangle) gets triggered. When the ‘Full GC’ event runs, memory utilization drops all the way to the bottom i.e., ~200MB. Please see the dotted black line in the graph which connects all the bottom points. You can notice this dotted black line is going at 0°. It indicates that the application is in a healthy state & not suffering from any sort of memory problems.

Fig: Application Suffering from Acute Memory Leak in Performance Lab

Above is the garbage collection behavior of an application that is suffering from an acute memory leak. When an application suffers from this pattern, heap usage will climb up slowly, eventually resulting in OutOfMemoryError.

In the above figure, you can notice that the ‘Full GC’ (red triangle) event gets triggered when heap usage reaches around ~8GB. In the graph, you can also observe that amount of heap that full GC events could recover starts to decline over a period, i.e.

When the first Full GC event ran, heap usage dropped to 3.9GB
When the second Full GC event ran, heap usage dropped only to 4.5GB
When the third Full GC event ran, heap usage dropped only to 5GB
When the final full GC event ran heap usage dropped only to 6.5GB

Please see the dotted black line in the graph, which connects all the bottom points. You can notice that black line is going at 15°. This indicates that this application is suffering from an acute memory leak. If this application runs for a prolonged period, it will experience OutOfMemoryError. However in our performance labs, we don’t run the application for a long period.

When this application is released into production, you will see the below behavior:

Fig: Application Suffering from OutOfMemoryError in Production

In the above graph, towards the right side, you can notice that Full GC events are continuously running, however memory size doesn’t drop. It’s a clear indication that the application is suffering from memory leak. When this pattern happens, already customers would have been impacted and it’s too late to catch the problem.

Thus, observing GC behavior in the performance lab, will facilitate you to catch OutOfMemoryErrors early in the game.

You don’t need to build custom instrumentation to track these micro-metrics. They’re already available in your JVM’s Garbage Collection (GC) logs. By analyzing those logs with a GC log analysis tool like GCeasy, you can automatically extract patterns, spot leaks early, and visualize trends that are easy to miss with manual inspection.

Besides memory related micro-metrics there are few other micro-metrics that you can track in your performance lab to prevent other performance labs. You can learn about those metrics from this post: 9 Micro-Metrics That Forecast Production Outages in Performance Labs

6. OQL

OQL (Object Query Language) is a specialized query language that is used to extract and analyze objects in a Java heap dump. Similar to how SQL (Structured Query Language) retrieves data from a database, OQL retrieves information about objects, such as their fields, values, references, sizes, and relationships. It’s useful for diagnosing memory leaks, identifying large object graphs, and exploring how objects are interconnected in the JVM. You can learn more about them in this post: Object Query Language (OQL) in Memory Analyzer

Example 1: Find Top Memory Consumers

If you want to identify the top 10 largest objects by retained size in your heap dump, you can run:

SELECT TOP 10 classof(o), sum(retainedSize(o)) AS retained_size
FROM OBJECTS o
GROUP BY classof(o)
ORDER BY retained_size DESC

What it does:

Group objects by their class (classof(o)).
Calculates the memory retained by each object (retainedSize(o)).
Orders the results to show which classes are holding the most memory.

This gives you a macro-level view of memory usage and quickly highlights potential memory hogs like large collections, caches, or image buffers.

Example 2: Find Oversized Strings

Sometimes, the problem lies not in entire classes but in specific objects. For example, you can search for unusually large strings (e.g., log dumps, XML payloads, or JSON blobs) like this:

SELECT s.toString()
FROM java.lang.String s
WHERE s.value.length > 1000

What it does:

Filters for all java.lang.String objects in the heap.
Uses s.value.length > 1000 to catch oversized strings.
Returns the actual string content so you can see what’s consuming memory.

This gives you a micro-level view of memory usage of particular type of objects in memory.

Why Do Both Views Matter? Macro queries (like retained size by class) show you the big picture, while micro queries (like filtering oversized strings) help you drill into specific culprits. Together, they give you a well-rounded perspective, making OQL a critical tool in advanced heap dump analysis.

7. Sunburst Chart to visualize Objects in Memory

Fig: Sunburst chart showing visual representation of Objects in the Memory

When dealing with millions of objects, scrolling through plain lists can be overwhelming. That’s where visualization comes in handy. A Sunburst Chart provides an interactive, hierarchical view of memory usage that makes patterns immediately visible.

HeapHero Sunburst chart shows:

Top memory-consuming objects.
Their retained size (in MB/GB) and percentage of memory they occupy.
Interactive navigation through which you can click on an object to drill down into its child objects and see how much space they retain.

This approach turns raw numbers into a visual story, helping you to quickly spot heavy hitters, nested leaks, or unusual memory retention patterns that would otherwise hide in a sea of data.

8. Common Pitfalls to Avoid

Heap dumps are powerful, but missteps during collection or analysis can lead to wasted effort or misleading conclusions. Here are three common pitfalls to want to avoid:

a. Taking Heap Dumps on Live Systems Without Considering Performance Overhead: Heap dumps are heavy operations, they pause the JVM, consume disk space, and can spike CPU or I/O. On a production system handling real traffic, this can degrade performance or even trigger outages. Capture Heap Dump only when you absolutely need it. If your application is running in a cluster of JVMs, don’t take heap dump from all the JVMs, just capturing from one JVM instance should be good enough.

b. Misinterpreting Shallow Size vs Retained Size: Shallow size shows the memory used directly by an object, while retained size reflects the memory that would be freed if that object were garbage-collected. Focusing only on shallow size leads to underestimating impact, since a “small” object might be keeping an enormous graph of dependent objects alive. Retained size is usually more valuable when diagnosing leaks. Learn more in this post SHALLOW HEAP, RETAINED HEAP

c. Looking Only at the “Biggest Objects” Instead of Tracing Reference Chains: It’s tempting to sort objects by size and assume the largest ones are the problem. But leaks are often caused by reference chains—small objects preventing garbage collection of much larger retained sets. Always trace incoming and outgoing references to see why memory isn’t being released, rather than just blaming the largest individual objects.

Conclusion

I hope advanced techniques shared in this post will facilitate you to analyze Heap Dumps more effectively. Happy Hacking!! Happy Troubleshooting!!

Advanced Heap Dump Analysis Techniques

Video

1. REST API

2. Downloading from Remote Storage

3. Masking Sensitive Information

4. 360° Analysis

5. Shift Left Strategy: CI/CD Integration

6. OQL

7. Sunburst Chart to visualize Objects in Memory

8. Common Pitfalls to Avoid

Conclusion

YOU MAY ALSO LIKE

2 thoughts on “Advanced Heap Dump Analysis Techniques”

Add yours

Share your Thoughts!Cancel reply

About

Popular Topics

Troubleshooting Tools

Video

1. REST API

2. Downloading from Remote Storage

3. Masking Sensitive Information

4. 360° Analysis

5. Shift Left Strategy: CI/CD Integration

6. OQL

7. Sunburst Chart to visualize Objects in Memory

8. Common Pitfalls to Avoid

Conclusion

YOU MAY ALSO LIKE

2 thoughts on “Advanced Heap Dump Analysis Techniques”

Add yours

Share your Thoughts!Cancel reply

About

Popular Topics

Troubleshooting Tools

Discover more from HeapHero – Java & Android Heap Dump Analyzer