Distributed Tracing Context Leaks using OpenTelemetry

Observability may be defined as the ability to view what’s happening inside a software system by analysing its metrics to avoid production outages. The observability stack is quite important to troubleshoot performance issues like memory leaks, increased thread usage, high CPU consumption etc. In the Java ecosystem, Micrometer coupled along with Spring Boot actuator is a common way of pulling metrics and publishing them to a time series database like prometheus or generating traces in applications using Spring Cloud Zipkin. Logs of production Java applications can be shipped into an ELK pipeline via tools like filebeat and viewed on Kibana dashboards. Many organisations have these systems already in place. Now the question that may be coming in your mind is – if this is already super easy and sorted then why do we need OpenTelemetry and in fact a connected question : what is OpenTelemetry? What problem does it solve? Why should I use OpenTelemetry? We will find answers to all these questions in this article. We will also take a system of 3 Java microservices and configure Distributed tracing in it using OpenTelemetry. Before we jump into technical concepts, it is imperative to understand the meaning of a few terms to be on the same page.

The 3 pillars of Observability

There are 3 important terms the meaning of which we need to understand first, before we jump into OpenTelemetry.

Metrics 

It is defined as the numeric measurements collected at regular intervals from a software component. For e.g. application throughput, average response time etc.

Traces

It is the end to end journey of a single request as it travels through a distributed system or a network of microservices. Traces help to diagnose and narrow down the performance issues by pointing to the root cause of issues causing bottlenecks.

Logs

It can be defined as the timestamped records of events that happened in a software  system. It is like a diary maintained by a software component.

What Is OpenTelemetry in Java?

OpenTelemetry is an open-source, vendor-neutral observability framework that provides standard APIs, SDKs, and tools to generate, collect, and export telemetry data (logs, metrics, and traces). Open source, as well as vendor- and tool-agnostic, meaning that it can be used with a broad variety of observability backends, including open source tools like Jaeger and Prometheus, as well as commercial offerings. OpenTelemetry is not an Observability backend itself. In modern cloud native architectures, observability is no longer optional; it is a fundamental requirement. You want to understand what your application is doing via metrics, how requests are flowing through it via traces, and what it is saying via logs. The OpenTelemetry project, called otel, provides a vendor-neutral, open-source framework to collect, process, and export telemetry data. Backed by the Cloud Native Computing Foundation, it offers an API, an SDK, a standard wire protocol called OTLP for exporting data, and a pluggable architecture (including the OpenTelemetry Collector) for handling ingestion, processing, and export to backends. Instrumented projects use the API to emit observability data. The SDK, which implements that API, is used to configure how data is collected and exported. The Spring ecosystem has strong observability support via Micrometer, and combining Spring Boot with OpenTelemetry is a powerful way to cover all observability signals (metrics, traces, and logs). The key enabler here is the OTLP protocol rather than a specific library.

OpenTelemetry Architecture

A major goal of OpenTelemetry architecture is to enable easy instrumentation of your applications and systems, regardless of the programming language, infrastructure, and runtime environments used. The backend (storage) and the frontend (visualization) of telemetry data are intentionally left to other tools.

Fig: OpenTelemetry architecture

The otel collector is the heart of OpenTelemetry architecture. The key aspect of this architecture is that there is no vendor lock in so you can plug and change any backend of the stack. The same code can connect to any observability backend.

Understanding Distributed Tracing and Context Propagation

Tracing is defined as the propagation of a user request through your application. When the request travels across multiple software components or microservices, it is called distributed tracing. Traces give us the big picture of what happens when a request is made to an application. Whether your application is a monolith with a single database or a sophisticated mesh of services, traces are essential to understanding the full “path” a request takes in your application. Span is a small unit of work. It could be a DB or HTTP call. A trace is often a collection of multiple spans. There exists a parent child relationship between spans and traces. Traces are stored in a special datastore called Tempo. This data then can be visualized effectively on a dashboard like Grafana.

Fig: Distributed tracing in action on Grafana UI

Why Distributed Tracing is not enough

Although distributed tracing gives visibility into how a request flows through different Java microservices, it alone cannot provide complete observability of a software stack. Traces show the propagation and latency of individual requests, but they don’t capture the full operational picture.

For example, tracing can reveal that a request is slow in a particular microservice, but it may not explain why the slowdown occurred. To understand the root cause, engineers often need metrics (such as CPU usage, memory consumption, request rate, or error counts) and logs (detailed event records and error messages).

What is Instrumentation

Instrumentation is the process of extracting traces from a business Java application. There are 2 broad categories of instrumentation. They are as follows:

Automatic / Zero-code

As the name suggests, this is an instrumentation type where you do not add any code in src/main/java of your projects. You do NOT do any code change. Instead a specialized agent runs alongside your application to emit OpenTelemetry data. The agent is a jar file developed and maintained by the OpenTelemetry team and runs like a sidecar when attached to your Java application. The OpenTelemetry Agent can be downloaded from here. This instrumentation is hassle free and good for quick setup and instant visibility. We will have OpenTelmetry data emitted by our application for almost all famous standard frameworks (HTTP, JDBC, gRPC, messaging, etc.). The automatic instrumentation is best suited for teams new to observability who want OpenTelemetry data without much effort.

Manual / Code-based

This is an instrumentation type where you add code in your projects to retrieve custom or business specific data. Thus, in manual instrumentation, you add the code to emit telemetry data. It is good for capturing business-specific operations (e.g., “available inventory” or “number of items sold”). It is handy and helps in debugging complex issues beyond what auto-instrumentation covers. It allows fine-grained control over what gets observed.

Setting up the OpenTelemetry Observability Stack

We can boot up the OpenTelemetry observability stack locally using a single docker compose file. Based on the architecture discussed above, we will start 6 containers in total. 3 are business Java applications. The rest 3 are the OpenTelemetry stack components.

services:
service-a:
image: javalanes/service-a
container_name: service-a
volumes:
- ./docker-volume/otel:/otel
command: >
java
-javaagent:/otel/opentelemetry-javaagent.jar
-Dotel.javaagent.configuration-file=/otel/opentelemetry-config.properties
-Dotel.service.name=service-a
-jar /app/app.jar
depends_on:
- otel-collector
environment:
"service-b.url": "http://service-b:8080/api/abc/"
"service-c.url": "http://service-c:8080/api/xyz"
ports:
- "8080:8080"
service-b:
image: javalanes/service-b
container_name: service-b
volumes:
- ./docker-volume/otel:/otel
command: >
java
-javaagent:/otel/opentelemetry-javaagent.jar
-Dotel.javaagent.configuration-file=/otel/opentelemetry-config.properties
-Dotel.service.name=service-b
-jar /app/app.jar
depends_on:
- otel-collector
service-c:
image: javalanes/service-c
container_name: service-c
volumes:
- ./docker-volume/otel:/otel
command: >
java
-javaagent:/otel/opentelemetry-javaagent.jar
-Dotel.javaagent.configuration-file=/otel/opentelemetry-config.properties
-Dotel.service.name=service-c
-jar /app/app.jar
depends_on:
- otel-collector
otel-collector:
image: otel/opentelemetry-collector-contrib:0.133.0
container_name: otel-collector
command: ["--config=/etc/otel/collector-config.yml"]
volumes:
- ./docker-volume/collector-config.yaml:/etc/otel/collector-config.yml
tempo:
image: grafana/tempo:2.8.2
container_name: tempo
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./docker-volume/tempo.yaml:/etc/tempo.yaml
depends_on:
- otel-collector
grafana:
image: grafana/grafana:12.0
container_name: grafana
volumes:
- ./docker-volume/grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasource.yaml
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
depends_on:
- tempo

How OpenTelemetry Context Propagation Can Cause Memory Leaks

The parameters like trace ID and span id are generated at appropriate times by the OTEL dependency we added in our pom.xml. Propagation is a way that moves context between services and processes. It serializes or deserializes the context object and provides the relevant information to be propagated from one service to another. There exists a Propagators API behind the scenes, which allows data to be transmitted from one service to another using HTTP headers. OpenTelemetry maintains multiple propagators, the default one uses the headers specified by the W3C TraceContext specification.

Custom Context Propagation

In most cases, instrumentation libraries or native library instrumentation will handle the context propagation. In some cases no such support is available and you want to create that support for yourself. To do so you need to leverage the previously mentioned Propagators API

On the side of the sender, the context is injected into the carrier, for example, into the headers of an HTTP request. On the receiving side, the context is extracted from the carrier. Again, in the case of HTTP, this is retrieved from the headers. This is the most standard and common way.

What Is MDC (Mapped Diagnostic Context)?

A framework-independent interface that allows key value data to be stored in threads. Must be cleared at the end of each request. Typically used in logging frameworks. SLF4J offers the interface for it. Log4J2 offers the implementation. Generally used by framework writers and library authors but user session information, order id, userId etc can also be and are typically maintained by application developers in MDCs in enterprise Java applications.

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
@RestController
@RequestMapping("/leak")
public class LeakController {
private static final Logger log = LoggerFactory.getLogger(LeakController.class);
@GetMapping
public void leak() {
log.info("Processing request");
String largeData = "X".repeat(5 * 1024 * 1024); // creates a string of size 5MB
MDC.put("leak", largeData);
// Simulate some processing
try {
Thread.sleep(100);
} catch (InterruptedException ignored) {}
// intentionally not clearing MDC
}
}

What is a MDC Leak?

A MDC memory leak is a situation when the ThreadlLocal values are not cleared after a request finishes, causing it to persist and grow over time. Since MDC uses ThreadLocal, the data remains attached to the thread. In high throughput applications, usage of custom thread pools is pretty common. Thread Pooling is a concept used to avoid the recreation of threads and to minimise overhead around thread creation since it is a fairly expensive process. The default Tomcat comes with 200 threads in its thread pool. In an enterprise Java application created using Spring Boot, threads are reused so old data remains attached to new requests. This stale data can cause business issues as well as memory and performance bottlenecks. Since Spring applications are web server based, they are long lived apps where the service runs continuously for months or maybe years. Threads are reused via executors. The MDC values are added but not removed or cleared by programmers in their code, then the long lived threads accumulate stale context and may start causing memory leaks. Other problems caused due to MDC leaks are incorrect log correlation (mixed request IDs), increased memory usage over time, hard to debug production issues, potential data leakage between concurrent requests.

How to Detect Context Leaks

  • Traces showing unrelated requests linked together
  • Same trace ID appearing across different users
  • Spans continuing longer than expected
  • Logs showing mismatched request IDs

In the above Java program, the application spins up on the embedded Tomcat server. The default thread pool size of Tomcat is 200 (although it can be tuned for high throughput)

We can demonstrate large traffic hitting our API endpoint to simulate the context memory leak using the script given below:

while true; do
curl http://localhost:8080/leak &
done

As soon as the bash snippets fires up, we wait for a few seconds and then invoke the heap dump generation using a command line tool

jcmd <process id> GC.heap_dump heap.hprof

To know the process ID, we can simply scan the first line of our Spring Boot app logs

:: Spring Boot ::               (v3.5.10)
2026-03-02T18:51:28.941+05:30  INFO 42223 --- [threads]
[           main] com.javalanes.threads.ThreadsApplication     :
Starting ThreadsApplication using Java 25 with PID 42223
(/Users/javalanes/Downloads/Tech Projects/threads/target/classes
started by javalanes in /Users/javalanes/Downloads/Tech Projects/threads)
2026-03-02T18:51:28.942+05:30  INFO 42223 --- [threads]
[           main] com.javalanes.threads.ThreadsApplication     :
No active profile set, falling back to 1 default profile: "default"

So in our case the command would be:

jcmd 42223 GC.heap_dump heap.hprof

We can then upload this heap dump to tools like HeapHero and gain insights about the context memory leak.

Fig: HeapHero showing Tomcat threads having heap size of 5 MB each

Fig: HeapHero showing String and byte array as unreachable dominator objects

Fig: HeapHero histogram showing large number of strings consume a large heap size

Fig: Thread information of tomcat thread pool pointing to large retained heap size

Open the thread report on fastThread and you will see it highlighting the repeat() method in the stack trace. This gives a hint to the point of leak in the code.

Fig: fastThread showing Tomcat threads pointing to the root of problem

Conclusion

In this article we introduced the 3 pillars of Observability. We discussed what OpenTelemetry is and what its advantages are.  We learnt the concept of instrumentations, zero code using an agent provided by OpenTelemetry and manual instrumentation- where we need to write code but have fine grained control to emit business data to power enterprise decision making. We then understood what distributed tracing is in the Java ecosystem followed by setting up an end to end OpenTelemetry observability stack for a Java microservices system using Docker compose. Having discussed context propagation, we deep dived into Mapped Diagnostic Context and wrote a spring boot application controller to demonstrate a context memory leak in Java. Ultimately, we used HeapHero to diagnose and identify the issue we created, as well as learnt how MDC gets attached to threads by analysing threads in depth via FastThread.

Share your Thoughts!

Up ↑

Index

Discover more from HeapHero – Java & Android Heap Dump Analyzer

Subscribe now to keep reading and get access to the full archive.

Continue reading