From time to time, my home is disconnected from the Internet, typically due to an outage of the Internet Service Provider. I do receive notification about my cameras going offline, but I have no idea how long it lasts; it can range from 10 minutes to 12 hours, and Comcast won’t tell you how long (you don’t even have access to the outage information unless you log in with the Comcast account). In addition, sometimes the Internet isn’t fully down, but the quality drops, so I’d like to set up something to monitor my Internet connection.
Journey of solution-hunting
The principle is simple: I set up something to ping (or cURL) https://google.com. Given the historical availability of that page, this probe is going to tell me whether I can reach Internet. Now the real question is, what tool can do the queries, save the data, and visualize the latency for me.
I did some research, and asked around. Most people don’t care; the ones who care typically uses SmokePing:
SmokePing is a latency logging and graphing and
alerting system. It consists of a daemon process which
organizes the latency measurements and a CGI which
presents the graphs.
SmokePing has its own web UI to show the data. The UI looks good enough for most users, but I’m not an average user, but this is insufficient for an SRE-SWE. I’m more of a backend guy, so not very good at implementing the interactive features myself. Another option I found was Nagios:
Nagios Core, formerly known as Nagios, is a free and open-source computer-software application that monitors systems, networks and infrastructure. Nagios offers monitoring and alerting services for servers, switches, applications and services. It alerts users when things go wrong and alerts them a second time when the problem has been resolved.
I tried it briefly, before I had another idea: simply make a daemon (or a scheduled task) to post to some data-hosting service, such as Google Cloud Monitoring (also known as “StackDriver”. (Disclaimer: I’m a Googler.) I did some more research, and summarized my options in a GitHub issue:
Cloudprober is a monitoring software that makes it super-easy to monitor
availability and performance of various components of your system. Cloudprober
employs the "active" monitoring model. It runs probes against (or on) your
components to verify that they are working as expected. For example, it can run
a probe to verify that your frontends can reach your backends. Similarly it can
run a probe to verify that your in-Cloud VMs can actually reach your on-premise
systems. This kind of monitoring makes it possible to monitor your systems'
interfaces regardless of the implementation and helps you quickly pin down
what's broken in your system.
Cloudprober was created by Googlers (probably as a side project), and it supports uploading data to Google Cloud Monitoring (exactly what I needed). The tool was designed for black-box service monitoring of owned services, but it can actually be pointed to any service. For example, I monitor the Google homepage with this configuration:
This creates nice charts on Google Cloud console like
Starting Cloudprober yourself
First, of course you need to download Cloudprober. You can follow the official guide to download the pre-built binary, or (if you are using Arch Linux) use my AUR package.
Then you need a Google Cloud project, and create a service account key. Follow the Google Cloud documentation to set the environment variable.
Now save the configuration file somewhere, for example in ~/Desktop/cloudprober.textproto, and run Cloudprober like
The configuration file is specified in the text format of Procotol Buffer, also known (in Google) as “Text-Proto”. Most parts are straight forward; the latency_distribution stanza is explained in Google Cloud documentation. If we denote the scale-factor as “k” and base as “a”, then basically the buckets are the right-open intervals
[kai,kai+1)
except the first and last bucket (which has to cover the minimum and maximum). My strategy of choosing the base and the scale factor boils down to the “target interval”, which is the interval of latency that I care about. If we denote the number of buckets as “n”, then my strategy is that the entire “target interval” should be covered by buckets from second to the “second to last”. In other words, the “target interval” is supposed to be a subset of