Why I Developed Kyanos

Have you ever faced a problem like this?

You just joined a new company and are responsible for a backend service. Everything is going smoothly...

On a Friday afternoon at 5 PM, with just an hour left before you clock out, suddenly, trouble hits. Your upstream is angrily asking you 😡 why your interface is timing out.

You panic 😩 but try to stay calm and check the monitoring, only to find that your service's interface latency is normal.

Just as you’re about to argue back, you suddenly realize that the company's monitoring only tracks server-side application latency, with no visibility into kernel and network latency!

As a result, neither side can convince the other 👿, leading to a blame game and the issue remains unresolved...

Conversely, if you experience a timeout with downstream interfaces but their monitoring shows no issues, a new blame game begins, only this time you’re on the other side...

So, how can we solve this problem?

Use tcpdump for packet capture! However, the biggest drawback of tcpdump is its troubleshooting process: it is too slow.

For example, if you need to troubleshoot an issue in production:

a. First, you need to install tcpdump in the production environment 💤,

b. Then find the IP and port of the problematic server 👁👁

c. Write a filter expression yourself (which you might have forgotten how to write, spending 5 minutes searching~😂)

d. After much effort, you finally download a several hundred MB pcap file from the production machine 💤

e. Install a tcpdump client (if you haven’t already) 💤

f. Load the pcap file and watch your CPU fan spin wildly 😡

g. After a few minutes, you squint 👁👁 to manually verify whether the interface you want is in this pcap file, only to find that it isn’t due to too much irrelevant information 🤡...

h. Then you start capturing packets again, checking if your tcpdump command was correct... 😔
Ask the operations team to install monitoring tools! Fortunately, there’s a powerful technology called eBPF that allows for deep kernel-level data collection and full-stack observability. Modern tools like Skywalking, Pixie, and Deepflow offer impressive products.

But this isn’t easy! These monitoring tools either require a high kernel version (5.x), only monitor K8s traffic, or generate TBs of monitoring data per hour, needing heavy storage dependencies.

So, is there a lightweight, compatible with older kernel versions, and highly efficient tool for troubleshooting network issues?

Kyanos is here!

What is Kyanos

Kyanos is an open-source command-line tool 👉 Kyanos repository 👈 that supports kernels as low as version 3.10 and runs without any additional dependencies. All you need to do is download the executable file (Release download link).

So, what can Kyanos do?

Before diving into the details, let’s see an example of what Kyanos can do.

You don’t need to understand any filter syntax. Just run a single command (kyanos stat http ...), and Kyanos will find the slowest HTTP requests and provide details on their latency (imagine how much time it would take with tcpdump):

Here, one command finds several of the slowest requests.

If you want to print the content of the request and response, you can do it like this:

As you can see, the request and response content is printed directly.

Kyanos is not only easy to install, but also aligns perfectly with our troubleshooting needs. Unlike packet-based tools, which are too granular and often contain a lot of irrelevant information, Kyanos is completely non-intrusive and works at the application layer, filtering out unnecessary data and retaining only the most valuable information for troubleshooting.

So, what exactly can Kyanos do? The main features of Kyanos are:

Capturing request and response data for various protocols (HTTP, MySQL, Redis, etc.).
Performing higher-dimensional analysis of the captured traffic through aggregation.

I won’t go into too much detail here—let’s dive straight into examples! 🤞

Detailed Analysis--watch

The watch command allows you to capture request and response data for various protocols (HTTP, MySQL, Redis, etc.) using a range of filtering conditions. You don't need to understand any filtering expressions to easily capture and analyze the request and response data you need.

For example, if you have a Spring Cloud application and monitoring alerts show that accessing a remote interface /foo/bar occasionally has some p99 spikes (e.g., requests taking more than 1 second), indicating some long-tail requests, how would you identify the root cause of the issue?

It's simple: use the watch command in Kyanos to see where the requests taking more than 1 second are spending their time:

kyanos watch http --pid {your_pid} --latency 1000 --path /foo/bar

It will output results similar to this:

You can see that watch outputs several parts:

The content of the request and response (note: if it is too long and exceeds 1024 bytes, it will be truncated).
The total duration, i.e., the time from when the request was initiated to when all responses are received.
Kernel and external durations: including time spent reading the response from the socket buffer and network time (the time from when the request reaches the network card to when the response is completely received by the network card).
System call details: including the number of read and write system calls and the amount of data read and written.

Kyanos not only supports capturing request and response content but also calculates network and kernel durations, which is extremely helpful for troubleshooting!

Currently, watch supports capturing traffic for HTTP, MySQL, and Redis protocols (though this is not exhaustive; more protocols will be supported in the future) and supports various filtering conditions. For more details, see the GitHub documentation: Kyanos Command Details - Watch Section

Overview - `stat`

While the watch command offers a granular analysis perspective, stat provides more flexible and higher-dimensional analytical capabilities. In simple terms, it aggregates request and response metrics by certain dimensions. For example, if you want to know which remote IPs have the slowest HTTP interfaces, you can aggregate requests and responses by the same IP to find the slowest remote IP.

So it can answer questions like:

Help me find out which remote IPs on this machine have the slowest HTTP interfaces?

One command does it all: ./kyanos stat http --side client -i 5 -m n -l 10 -g conn. This command outputs the top 10 HTTP connections with the longest network latency every 5 seconds. The output looks like this:

The result shows two connections along with latency information such as avg, max, and pxx.

My machine has very high outbound traffic. Which HTTP requests are causing this?

One command does it all: ./kyanos stat http --side client -i 5 -m p -s 10 -g none. This command outputs the top 10 HTTP request responses with the largest response sizes every 5 seconds. The output looks like this:

The results include average, maximum, and pxx values of response sizes, as well as the HTTP request path information.

General Steps for Using the stat Command

First, determine the metrics you're interested in and specify them using --metrics. Kyanos supports aggregation for the following metrics:

Metric	Flag
Total Latency	t
Response Size	p
Request Size	q
Network Latency	n
Socket Buffer Read Latency	s

Next, specify the aggregation dimension using --group-by or -g. For example, if you're interested in whether different remote services provide different service quality, you can specify -g remote-ip. This will aggregate the request and response statistics by different remote IP addresses, making it easier to identify which remote service is causing issues. Kyanos supports the following aggregation dimensions:

Aggregation Dimension	Value
Finest granularity, aggregates to individual connections	conn
Remote IP	remote-ip
Remote Port	remote-port
Local Port	local-port
Protocol	protocol
Coarsest granularity, aggregates all request and response data	none

For a more detailed usage guide, see the GitHub page: Kyanos GitHub

The stat command aggregates the results observed by watch according to the user-specified aggregation dimension (--group-by), and finally outputs according to the user’s most concerned metrics (--metrics).

Using tcpdump would take so much time—with Kyanos, it’s done in seconds!

Conclusion

Although kyanos actually can't completely replace tcpdump just yet. For instance, Kyanos currently supports a limited set of application-layer protocols, and support for more complex network environments still needs improvement. However, it certainly can enhance the efficiency of troubleshooting in our development process. 👍

Why am I so confident? Because I have truly solved problems using Kyanos. If you've read my articles, you might know that I work with Redis. Kyanos helped me resolve a very peculiar issue where the Redis client was timing out but the Redis server showed no anomalies (I will be publishing an article about this process soon, so stay tuned if you're interested). I was able to diagnose the issue within 30 minutes, and Kyanos proved its worth. That's why I am confident in open-sourcing it, and I believe Kyanos can help others too!

Finally, could you please give a star to encourage me? 👉 Kyanos

Stop using tcpdump for packet capture! Kyanos helps you troubleshoot network issues in seconds.

Why I Developed Kyanos

What is Kyanos

Detailed Analysis--watch

Overview - stat

Conclusion

Overview - `stat`