Debugging OutOfMemoryError in a Microservices Architecture: Unique Challenges and Container-Specific Solutions

With the modern trend towards cloud computing, the use of microservices running in easy-to-deploy containers is becoming more and more widespread. Services can be packaged and installed painlessly using services such as Docker or Kubernetes.

Containers are great – until they go wrong. In particular, OutOfMemoryErrors in a microservices architecture are common, and can be challenging to troubleshoot.

This article looks at the special difficulties of troubleshooting when a Java application running in a container has memory issues, and suggests some solutions to make debugging simpler.

Challenges When Encountering an OutOfMemoryError in a Microservices Architecture

OutfMemoryErrors in Java are not usually difficult to find. This article, Common Memory Errors in Java and How to Fix Them, is a good guide to troubleshooting most OutOfMemoryErrors. I’d also recommend reading Types of OutOfMemoryErrors for a good understanding of Java memory issues.

However, when working with containers, there are a few special challenges we’re likely to encounter when troubleshooting.

  • The container often automatically restarts itself when an error occurs. Since not all storage in the container is persistent across reboots, this may result in the loss of critical information that could assist in troubleshooting;
  • Containers are usually designed to use the least possible resources, so it’s unlikely that troubleshooting tools are installed within the container;
  • Earlier versions of Java weren’t designed to work with containers. If no sizes are configured for the various memory areas, the JVM defaults to a percentage of available memory in the device. With containers, the available size does not depend on the amount of memory in the device, but on the limits set for the cgroup where it’s running. Prior to Java 10, the JVM didn’t take this into account, resulting in the container frequently running out of memory if the JVM wasn’t carefully configured.
  • If the container runs out of memory, it’s likely to silently kill the application. The application log won’t show an error, although the kernel log will record the event.
  • Microservices are often critical, and need to be stable.
  • Issues may not affect every instance of the container, making it harder to debug. This is because instances may be running in different environments, and may also be contesting for resources with other containers.

Suggested Solutions

Let’s look at some solutions to these issues.

1. Ensure Log Files and Diagnostic Information Are Placed on Persistent Storage

Disk storage within a container is not, by default, retained between reboots. We need to ensure that troubleshooting artifacts such as GC logs, application logs and kernel logs are directed to persistent storage so they will still be available after the system has crashed. If you’re not familiar with creating persistent storage, you may like to read:

2. Choose the Right JVM Version and Configure it Correctly

JVM versions prior to Java 11 may not work well with containers. Since containers are a fairly new innovation, it makes sense to use a Java version that is designed with containers in mind. It’s highly recommended to use a newer version of Java for microservices.

Java 10 works as well as later versions, provided the following argument is used when invoking the JVM: -XX:+UseContainerSupport. In later versions this is the default. It’s not available in earlier versions.

Configure a sensible size for the heap, rather than using system defaults. For hints on heap sizing, read this article: Sizing Your Heap Correctly. However, in containers, it’s recommended to specify the heap size as a percentage of available memory, rather than a fixed size. This makes it more flexible when running in different environments. The JVM switch for this is: -XX:MaxRAMPercentage=<percentage>.

For applications that make heavy use of the metaspace or direct buffers, we should also configure maximum sizes for these areas, or they may grow to the point where the container runs short of memory and kills the application.

3. REST APIs

These can be extremely helpful for sending diagnostic data for monitoring, without placing an undue load on the container. Tools such as HeapHero, fastThread and GCeasy all provide rest APIs to incorporate diagnostics into applications. This means that performance can be monitored proactively, and diagnostic data will be available after a crash.

4. On OutOfMemory Configuration in the JVM

The JVM argument -XX:OnOutofMemoryError=<command> allows us to specify an action to take if the JVM runs out of memory. This could be an operating system command, such as a request to take a heap dump, or a script that can initiate several actions. Remember to direct the heap dump to persistent storage.

Using yCrash to Simplify Diagnostic Processes

Let’s now look at a troubleshooting tool that works really well with microservices in containers.

The basis for fast resolution of production issues is comprehensive diagnostic information. We need to know as much as we can, not only about the application’s performance, but also about its operating environment.

This is where yCrash comes in. It gathers 360° data, covering everything we need to know in order to troubleshoot both the application and its container. The image below shows the artifacts that make up these comprehensive diagnostics.

Fig: Data Gathered by yCrash

Among other things, the data includes:

  • GC monitoring
  • Thread analysis
  • Heap and memory usage
  • Application logs
  • Processes
  • Networking Statistics
  • Resources
  • Disk Usage
  • Kernel log

yCrash can run on almost any platform, including popular containers.

The software can be run in two ways.

1. The yCrash Script

This is a free, open-source script available from yCrash’s GitHub. When triggered, it collects 360° data into a zip file. This can either be uploaded to the yCrash server, or used to provide input to other troubleshooting tools. It can be configured to run automatically whenever an OutOfMemoryError is encountered by using the JVM command line argument -XX:OnOutofMemoryError. It can also be invoked manually whenever the system is experiencing performance problems.

2. The yCrash Server and Agent

This is a powerful, lightweight tool for constant system monitoring and performance analysis. The yCrash agent gathers essential diagnostics and transmits them to the yCrash server, which can either be running in the cloud or on a local machine. If the server detects performance issues or impending crashes, it gathers 360° data and raises an alert.

The diagram below illustrates the server/agent architecture.

Fig: yCrash Server/Agent Architecture

yCrash agent has very little overhead. It gathers micrometrics, which can forecast outages before they happen. This allows us to take action to prevent production problems. Most diagnostics focus on macro-metrics such as CPU usage, memory usage and response times. Unfortunately, by the time these are affected by an issue, the application is already in trouble, and user experience is affected.

yCrash concentrates on micrometrics such as GC throughput, GC pause time, thread patterns and states, and thread-level CPU time. It uses machine learning to recognize patterns, and proactively raises the alarm before performance is degraded or the system crashes. This is ideal for microservices running in containers, because it can make diagnostic information available so we can solve problems while the system is still up and running.

The agent can either run in the container, as a sidecar, or on the host machine. For Docker, see this system documentation. For Kubernetes, the yCrash script can be configured as a pre-stop hook in order to gather diagnostics before the system recycles. 

Conclusion

OutOfMemoryError in a microservices architecture need not be a show-stopper. With sensible configuration, and ensuring diagnostic information is available in the event of a crash, we can quickly get the system back up and running.

With tools such as yCrash, we can also monitor the system to prevent outages and performance issues.

Share your Thoughts!

Up ↑

Index

Discover more from HeapHero – Java & Android Heap Dump Analyzer

Subscribe now to keep reading and get access to the full archive.

Continue reading