Background

From time to time, my home is disconnected from the Internet, typically due to an outage of the Internet Service Provider. I do receive notification about my cameras going offline, but I have no idea how long it lasts; it can range from 10 minutes to 12 hours, and Comcast won’t tell you how long (you don’t even have access to the outage information unless you log in with the Comcast account). In addition, sometimes the Internet isn’t fully down, but the quality drops, so I’d like to set up something to monitor my Internet connection.

Journey of solution-hunting

The principle is simple: I set up something to ping (or cURL) https://google.com. Given the historical availability of that page, this probe is going to tell me whether I can reach Internet. Now the real question is, what tool can do the queries, save the data, and visualize the latency for me.

I did some research, and asked around. Most people don’t care; the ones who care typically uses SmokePing:

oetiker / SmokePing

The Active Monitoring System

____                  _        ____  _             
/ ___| _ __ ___   ___ | | _____|  _ \(_)_ __   __ _ 
\___ \| '_ ` _ \ / _ \| |/ / _ \ |_) | | '_ \ / _` |
 ___) | | | | | | (_) |   <  __/  __/| | | | | (_| |
|____/|_| |_| |_|\___/|_|\_\___|_|   |_|_| |_|\__, |
                                              |___/

Original Authors: Tobias Oetiker and Niko Tyni

SmokePing is a latency logging and graphing and alerting system. It consists of a daemon process which organizes the latency measurements and a CGI which presents the graphs.

SmokePing is ...

Name: Monitor home Internet connectivity with Cloudprober
Rating: 5 (8596 reviews)
Author: franklinyu

extensible through plug-in modules
easy to customize through a webtemplate and an extensive configuration file.
written in perl and should readily port to any unix system
an RRDtool frontend
able to deal with DYNAMIC IP addresses as used with Cable and ADSL internet.

cheers tobi

View on GitHub

SmokePing has its own web UI to show the data. The UI looks good enough for most users, but I’m not an average user, but this is insufficient for an SRE-SWE. I’m more of a backend guy, so not very good at implementing the interactive features myself. Another option I found was Nagios:

Nagios Core, formerly known as Nagios, is a free and open-source computer-software application that monitors systems, networks and infrastructure. Nagios offers monitoring and alerting services for servers, switches, applications and services. It alerts users when things go wrong and alerts them a second time when the problem has been resolved.

I tried it briefly, before I had another idea: simply make a daemon (or a scheduled task) to post to some data-hosting service, such as Google Cloud Monitoring (also known as “StackDriver”. (Disclaimer: I’m a Googler.) I did some more research, and summarized my options in a GitHub issue:

Create a prober for connectivity of home Internet #23

FranklinYu posted on Oct 07, 2022

Need to know how good the connection is.

View on GitHub

When I was collecting libraries for this idea, I found Cloudprober.

Cloudprober

cloudprober / cloudprober

An active monitoring software to detect failures before your customers do.

NOTE: Cloudprober's active development has moved to github.com/cloudprober/cloudprober from ~~github.com/google/cloudprober~~.

Cloudprober is a monitoring software that makes it super-easy to monitor availability and performance of various components of your system. Cloudprober employs the "active" monitoring model. It runs probes against (or on) your components to verify that they are working as expected. For example, it can run a probe to verify that your frontends can reach your backends. Similarly it can run a probe to verify that your in-Cloud VMs can actually reach your on-premise systems. This kind of monitoring makes it possible to monitor your systems' interfaces regardless of the implementation and helps you quickly pin down what's broken in your system.

Features

Out of the box, config based, integration with many popular monitoring systems:
Multiple options for checks:
- Efficient, highly scalable, built-in probes HTTP PING…

View on GitHub

Cloudprober was created by Googlers (probably as a side project), and it supports uploading data to Google Cloud Monitoring (exactly what I needed). The tool was designed for black-box service monitoring of owned services, but it can actually be pointed to any service. For example, I monitor the Google homepage with this configuration:

# proto-file: https://github.com/cloudprober/cloudprober/blob/master/config/proto/config.proto
# proto-message: cloudprober.ProberConfig

probe: {
  name: "google_homepage"
  type: HTTP
  targets: {
    host_names: "www.google.com"
  }
  interval: "30s"
  timeout: "1s"
  latency_distribution: {
    exponential_buckets: {
      scale_factor: 100
      base: 1.1
      num_buckets: 25
      # last bucket starts at: 100 * 1.1^24 = 985 ms
    }
  }
  latency_unit: "ms"
  http_probe: {
    protocol: HTTPS
  }
}

surfacer: {
  type: STACKDRIVER
  stackdriver_surfacer: {
    project: "franklinyu-home"
  }
}

This creates nice charts on Google Cloud console like

Starting Cloudprober yourself

First, of course you need to download Cloudprober. You can follow the official guide to download the pre-built binary, or (if you are using Arch Linux) use my AUR package.

Then you need a Google Cloud project, and create a service account key. Follow the Google Cloud documentation to set the environment variable.

Now save the configuration file somewhere, for example in ~/Desktop/cloudprober.textproto, and run Cloudprober like

cloudprober --config_file ~/Desktop/cloudprober.textproto

Wait for a while, and the metric should appear in Google Cloud Console as

custom/cloudprober/http/google_homepage/latency

And you can see the data in Metric Explorer.

Configuration

The configuration file is specified in the text format of Procotol Buffer, also known (in Google) as “Text-Proto”. Most parts are straight forward; the latency_distribution stanza is explained in Google Cloud documentation. If we denote the scale-factor as “k” and base as “a”, then basically the buckets are the right-open intervals

[k a^i, k a^{i+1})

except the first and last bucket (which has to cover the minimum and maximum). My strategy of choosing the base and the scale factor boils down to the “target interval”, which is the interval of latency that I care about. If we denote the number of buckets as “n”, then my strategy is that the entire “target interval” should be covered by buckets from second to the “second to last”. In other words, the “target interval” is supposed to be a subset of

[k, k a^{n-2})