This is an introduction to a tool that I created for ad-hoc performance troubleshooting in an environment that has node_exporter installed for providing statistics: nodetop.
Node_exporter is normally used by prometheus to scrape operating system statistics, and quite often used to create dashboards and graphics using grafana. But the grafana dashboards might not be reachable. Or provide the exact details you want. And using prometheus directly is "spartan". If you have tried that, you know what I mean.
How about having a utility that simply uses the node_exporter endpoints and provides the details you need to drill down into CPU usage or disk IO, so you can use it pretty much like sar, iostat or dstat, but for a number of servers at the same time? That is exactly what nodetop does. But it can do more than that: it can also draw a diagram of CPU and disk statistics over the period of time that nodetop was measuring the endpoints, with the granularity that was set as measuring interval.
Because CPU and disk IO is crucial to PostgreSQL as well as YugabyteDB or any other database for that matter, this can be very helpful.
How does it work? Nodetop is written in rust, so currently you have to get a rust environment, and then cargo build --release
it (build the executable in release mode), which is described in the readme of the project.
Once nodetop is build, you can simply specify the hosts you want to see CPU details (-c
) or disk details (-d
), and specify the hosts (-h
) as a comma separated list, and specify the node_exporter port (-p
).
This is how the CPU overview of my just installed 3-node cluster looks like:
[yugabyte@ip-172-162-49-163 nodetop]$ ./target/release/nodetop -c -h 172.162.21.186,172.162.45.207,172.162.60.17 -p 9300
hostname r b | id us sy io ni ir si st | gu gn | scd_rt scd_wt | in cs | l_1 l_5 l_15
172.162.21.186:9300:metrics 5 0 | 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 | 0.000 0.000 | 0.000 0.000 | 0 0 | 0.000 0.040 0.050
172.162.45.207:9300:metrics 8 0 | 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 | 0.000 0.000 | 0.000 0.000 | 0 0 | 0.000 0.010 0.050
172.162.60.17:9300:metrics 6 0 | 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 | 0.000 0.000 | 0.000 0.000 | 0 0 | 0.020 0.020 0.050
172.162.21.186:9300:metrics 6 0 | 3.953 0.018 0.016 0.000 0.000 0.000 0.000 0.000 | 0.000 0.000 | 0.044 0.005 | 3449 3742 | 0.000 0.030 0.050
172.162.45.207:9300:metrics 7 0 | 3.945 0.018 0.018 0.002 0.000 0.000 0.000 0.000 | 0.000 0.000 | 0.050 0.005 | 3418 3659 | 0.000 0.010 0.050
172.162.60.17:9300:metrics 7 0 | 3.948 0.024 0.016 0.002 0.000 0.000 0.002 0.000 | 0.000 0.000 | 0.050 0.004 | 3455 3678 | 0.020 0.020 0.050
172.162.21.186:9300:metrics 6 0 | 3.940 0.018 0.016 0.000 0.000 0.000 0.000 0.000 | 0.000 0.000 | 0.054 0.004 | 4037 4381 | 0.000 0.030 0.050
172.162.45.207:9300:metrics 4 0 | 3.955 0.018 0.012 0.000 0.000 0.000 0.000 0.000 | 0.000 0.000 | 0.041 0.002 | 3143 3380 | 0.000 0.010 0.050
172.162.60.17:9300:metrics 6 0 | 3.942 0.018 0.014 0.000 0.000 0.000 0.002 0.000 | 0.000 0.000 | 0.051 0.004 | 3579 3809 | 0.020 0.020 0.050
172.162.21.186:9300:metrics 7 0 | 3.955 0.020 0.016 0.000 0.000 0.000 0.000 0.000 | 0.000 0.000 | 0.040 0.004 | 3474 3755 | 0.000 0.030 0.050
172.162.45.207:9300:metrics 7 0 | 3.952 0.018 0.020 0.000 0.000 0.000 0.000 0.000 | 0.000 0.000 | 0.047 0.005 | 3114 3297 | 0.000 0.010 0.050
172.162.60.17:9300:metrics 6 0 | 3.954 0.020 0.018 0.000 0.000 0.000 0.000 0.000 | 0.000 0.000 | 0.043 0.004 | 2849 3037 | 0.020 0.020 0.050
It's mostly idle, the nodes are EC2 machines of type c5.xlarge, which have 4 vCPUs, and 3.9 seconds per second idle means the CPUs are mostly idle.
I added a set of statistics that is not used in traditional unix and thus linux utilities: the scheduler runtime and scheduler waiting for runtime figures (scd_rt and scd_wt). These do represent the total time the scheduler thinks tasks are running, and the total time for all tasks that are waiting to get runtime, alias the total scheduling delay.
This "sort of", partially is shown with 'r' (total number of tasks running), and with the load figures, but these do not give a comprehensive view. Scheduler runtime and scheduler waiting for runtime do.
This is how disk figures are shown:
[yugabyte@ip-172-162-49-163 nodetop]$ ./target/release/nodetop -d -h 172.162.21.186,172.162.45.207,172.162.60.17 -p 9300
reads per second | writes per second | discards per second | | totals per second
hostname merge io mb avg | merge io mb avg | merge io sect avg | queue | IOPS MBPS
172.162.21.186:9300:metrics nvme0n1 0 0 0 0.000000 | 0 0 0 0.000000 | 0 0 0 0.000000 | 0.000 | 0 0
172.162.21.186:9300:metrics nvme1n1 0 0 0 0.000000 | 0 0 0 0.000000 | 0 0 0 0.000000 | 0.000 | 0 0
172.162.45.207:9300:metrics nvme0n1 0 0 0 0.000000 | 0 0 0 0.000000 | 0 0 0 0.000000 | 0.000 | 0 0
172.162.45.207:9300:metrics nvme1n1 0 0 0 0.000000 | 0 0 0 0.000000 | 0 0 0 0.000000 | 0.000 | 0 0
172.162.60.17:9300:metrics nvme0n1 0 0 0 0.000000 | 0 0 0 0.000000 | 0 0 0 0.000000 | 0.000 | 0 0
172.162.60.17:9300:metrics nvme1n1 0 0 0 0.000000 | 0 0 0 0.000000 | 0 0 0 0.000000 | 0.000 | 0 0
172.162.21.186:9300:metrics nvme0n1 0 0 0 0.000000 | 0 1 0 0.000750 | 0 0 0 0.000000 | 0.001 | 1 0
172.162.21.186:9300:metrics nvme1n1 0 0 0 0.000000 | 0 5 0 0.000435 | 0 0 0 0.000000 | 0.002 | 5 0
172.162.45.207:9300:metrics nvme0n1 0 0 0 0.000000 | 0 1 0 0.000500 | 0 0 0 0.000000 | 0.000 | 1 0
172.162.45.207:9300:metrics nvme1n1 0 0 0 0.000000 | 0 4 0 0.000476 | 0 0 0 0.000000 | 0.002 | 4 0
172.162.60.17:9300:metrics nvme0n1 0 0 0 0.000000 | 0 0 0 0.000000 | 0 0 0 0.000000 | 0.000 | 0 0
172.162.60.17:9300:metrics nvme1n1 0 0 0 0.000000 | 0 5 0 0.000625 | 0 0 0 0.000000 | 0.003 | 5 0
My cluster was idle at the time of measurement, so there is not a lot to see, but it's about what can be seen. For reads, writes and discards (discards are a recent phenomenon) it shows the number of merges, IOs and MBPS, and the average latency of it (delta of time divided by the total number of IOs). It also show the average disk queue size, and a clear overview of the number IOPS and MBPS. These are often used in clouds and on premises, and are often forgotten to be clearly stated in this way.
Another thing that nodetop can do, is draw a graphic of the CPU and disk statistics that are obtained. Regardless of whether CPU or disk was specified when invoked, if the switch --graph
was specified, nodetop creates a cpu and disk graphic in the current working directory.
This is how that looks like for CPU:
And this is how that looks like for disk:
For the users of YugabyteDB, there is an additional gather mode, which is the YugabyteDB mode. The YugabyteDB mode gathers the statistics from the YugabyteDB server, and shows IO related statistics from the YugabyteDB servers. YugabyteDB servers provide statistics in prometheus format too: