With the rise of microservices based cloud applications & its corresponding complexities, the need for observability is greater than ever. This blog looks into the what-why of distributed tracing along with few best practices to adopt for the same in microservices architecture.

Distributed tracing for Microservices architecture is an emerging concept that is gaining momentum across internet-based business organizations.

We know that microservices architecture introduced an all-new way to scale an application (cloud) with several independent services. It does facilitate high resiliency, scalability, productivity, and efficiency when compared to monolithic architectures.

However, this comes with its own complexities like difficulty in tracing out the bugs or monitoring the traffic flow across the entire infrastructure.

So to eliminate these complexities, distributed tracing was introduced. This way of tracing helps in solving high-level debugging issues and improving visibility within the network. It also supports developers by narrowing down the end-to-end latency and errors that a specific service or function is experiencing at the moment.

This article aims at giving you an overall picture of the distributed tracing world, and its implications over microservices architecture.

Distributed Tracing Explained

Observability is monitoring the behavior of infrastructure at a granular level. This facilitates maximum visibility within the infrastructure and supports the incident management team to maintain the reliability of the architecture.

Observability is done by recording the system data in various forms (tools) such as metrics, alerts (events), logs, and traces. These functions help in deriving insights about the internal health of the infrastructure. Here, we are going to discuss the importance of tracing and how it evolved to a technique called distributed tracing.

Traces

Tracing is continuous supervision of an application’s flow and data progression often representing a track of a single user’s journey through an app stack. These make the behavior and state of an entire system more obvious and comprehensible. Distributed request tracing is an evolutionary method of observability that helps to keep cloud applications in good health.

Distributed tracing is the process of following a transaction request and recording all the relevant data throughout the path of microservices architecture. It is used across industries to inspect and visualize traces in a well-structured format. This way of data tracing helps SRE/DevOps teams to quickly understand and scrutinize the technical glitches that cause abnormalities within a system infrastructure.

This can be done by using tools such as OpenTelemetry (a standardized framework for observability across cloud-native applications) which is considered as a vendor-neutral approach to tracing.

Why is there a Need for Distributed Tracing?

A 2018 research shows that 63% of traditional enterprises are changing their facilities to microservices architecture. Since there was a major shift from monolithic to microservices architecture, the need for data tracing within a heavily distributed system became more evident. This distributed tracing drastically reduces the common challenges in monitoring systems with granular observability features.

Let’s imagine an interactive social gaming platform that has millions of users across the globe in all age groups. When a user has checked in some preferences in the platform, the system has to process the data with tight latency and deliver the appropriate outcomes. Here, distributed tracing plays a vital role in capturing each users' requests, processing them across various microservices, and delivers the expected results within a fraction of time.

Let’s see how distributed tracing helps the gaming infrastructure to handle the same.

Some of the use cases are,

Provides End-to-End visibility across the infrastructure

In the above gaming platform example, distributed tracing would track the user location, demographics, and store them in the system. It follows a user request and records all the necessary data associated with it. With this functionality, the platform would achieve end-to-end visibility inside its architecture.

Provides information about service dependencies

Every service in a microservices environment will be interdependent on each other while accomplishing a user request. Here, when players update their status it will be communicated to other players by accessing the central server and various other locality-based nodes within the architecture to accomplish this task. So each service request will give information about various other dependent services along the path.

Ensures Resiliency when the system encounters a failure

Consider an In-app purchase feature in the gaming platform that encounters a failure due to invalid user credentials. With distributed tracing, the developers can easily identify the API flow trace of the payment portal to rectify the failure instead of searching through various logs. It saves quite a lot of time by recording every transaction with necessary network data.

How Distributed Tracing Works?

Before we look into how distributed tracing is performed during a user request, let’s take a look at the basic terminologies.

Request: This denotes how various cloud applications, microservices, and other functions communicate with each other

Span: This informs about the work done by a single service with respect to time intervals and corresponding meta-data. These are the basic building blocks of trace.

Trace: This implies the end-to-end user requests which consist of single or multiple spans.

Tag: These are the pieces of information (meta-data) associated with each span (recorded along the path) that provide a detailed overview of the actions performed during a span.

A single trace contains a series of spans with associated tags.

Let's now discuss how Distributed Tracing handles a single request.

The process of distributed tracing starts when the end-user begins interacting with the systems and applications. For example, if a new user signs up for the interactive mobile gaming platform, the user will need to enter an email id and password.
Now, every user request is converted into an HTTP request and is assigned a unique trace ID (Global ID). Here, the user data would be fetched and assigned with a unique ID.
As the request is traveling through the host system every system operation is counted as Span, and sub-operations are counted as Child spans. The first span of a trace is also called Root Span. In our example, the email id would be root span and the password will be the child span.
Every user operation is tagged with three IDs,
(a) Request Trace ID,
(b) Parent Span ID,
(c) Child Span ID.
In this place, every span is denoted with three IDs
Every unique request of the end-user (Span) is encoded with all the information (tags) about processing the request. These data include,
(a) Name and Address of Microservice that is handling a User request
(b) Context of Events and Logs that are tied to the processes while executing the request
(c) Query and Filter request tags that indicate a request by its Session ID, Database Host, HTTP methods, and various other key identifiers
(d) Information about the error messages and stack traces when a system encounters a failure while processing the request
(e) Now all these processed data will get attached with a Global ID containing relevant information about the path a trace is traveling from source to destination
Finally, all the information about the trace in the user request’s journey is stored inside the respective data storage facility. In this case of gaming platform, the data will be stored in the backend server's database tier for future references

We have separate tools for performing distributed tracing across the architecture and these fall into three categories.

Types of Distributed Tracing Tools

Code Tracing Tools: Performs tracing during the execution of a computer program (Code). These tools help in tracing every line of code, the variables declared, the conditional statements used, the iterative functions, and finally deliver the expected output of the code. These are of great help in code analysis and diagnosing purposes. Some examples of Code Tracing tools are, OpenTracing, OpenZipkin, and Appdash.
Data Tracing Tools: Executes tracing during validating the critical data elements (CDE) or telemetry data with the source system and monitoring them with the statistical process control (SPC) methods. Some examples of Data Tracing tools are, Datadog, Jaeger, New Relic, Dynatrace, and Lightstep.
Program(Process) Tracing (ptrace) Tools: Establishes tracing operation during the execution of the application. Contains the traces of the index of instructions executed and the data referenced during execution. These are greatly used by developers for debugging purposes. Some examples of ptrace tools are, Strace, Ltrace, Opensnoop, and Valgrind Lackey.

Additional Reading: Top Observability tools for DevOps Engineers and SREs

How To Get Started With Distributed Tracing for your infrastructure?

Listed below are few links that can be helpful in getting started with distributed tracing within microservices architecture.

To implement distributed tracing across your architecture, follow the steps outlined here, OpenTelemetry (OpenTracing + OpenCensus)
Organizations that have Jaeger running natively across Docker can follow the steps mentioned in the Jaeger documentation
If you have configured your infrastructure with Java or Docker, follow these steps for applying OpenZipkin across your infrastructure
To apply a distributed tracing pattern for your architecture refer Distributed tracing pattern
To implement distributed tracing across your microservices-based web application - IBM Garage methodology
To track the system request along the network path and understanding why systems don’t work as expected, check here: Distributed tracing Guide
To understand your microservices architecture and its behavior with distributed tracing, check here: Understanding microservices with distributed tracing

So, by executing or practicing the above strategies, a distributed tracing system can be implemented across any microservices architecture.

Now, with the increased adoption of distributed tracing, along comes practical challenges. To stay reliable, we should maintain best practices while implementing this functionality.

Best Practices while Adopting Distributed Tracing in Microservices Architecture

Do implement end-to-end instrumentation and record the traces over all of your inbound and outbound service calls
Focus on SRE golden signals such as latency, traffic, errors, and saturation (utilization) along with RED (Response, Error, and Duration) metrics to set up alerts on them while recording all the system traces. Take note of the duration metrics to study system behavior
Always adhere to OpenTelemetry (OpenTracing + OpenCensus) standardization and make sure your tools are compliant with global standards
Document all the customized business metrics and the tracing spans for future reference

Additional Reading: Kubernetes Operators for Automated SRE

Conclusion

Distributed tracing is an efficient technique for monitoring microservices architecture. It gives more precise data and information about the network path. By adopting standardized distributed tracing tools along with end-to-end instrumentation of SRE golden signals metrics, we can wade through the challenges in implementing the same.

Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

Using Distributed Tracing in Microservices Architecture