Unlock 1 Billion NYC Taxi Rides: A Step-by-Step Guide

Emily Johnson - Oct 4 - - Dev Community

Unleashing Insights: Visualize NYC Taxi Data with Timescale and Grafana

Grafana, an open-source analytics and monitoring platform, is often used to visualize time-series data. In this tutorial, you will learn how to configure Grafana and Timescale Cloud, leverage Grafana to visualize metrics stored in TimescaleDB, and visualize geospatial data using Grafana.

Configuring Grafana and Timescale Cloud

To get started, set up Timescale Cloud. If you prefer to run your own instance of TimescaleDB, follow the installation instructions, and the remainder of the tutorial should be relatively straightforward to follow.

Once you've set up Timescale Cloud, you'll have a functional version of TimescaleDB with preloaded data. In this case, we'll use the New York City taxicab data found in the "Hello, Timescale!" tutorial. For background information on using TimescaleDB, be sure to follow the full tutorial.

If you're using Timescale Cloud, you can set up a Grafana Metrics Dashboard from the Create Service flow. Alternatively, you can set up Grafana Cloud and follow the rest of the instructions below. Note that Grafana Cloud is more feature-rich than the open-source version of Grafana included with Timescale Cloud, but requires a paid subscription from Grafana.

Finally, configure Grafana to connect to your Timescale Cloud instance (or your own installation of TimescaleDB). Begin by selecting 'Add Data Source' and choosing the 'PostgreSQL' option in the SQL group. In the configuration screen, provide the Host, Database, User, and Password for your Timescale Cloud instance (or TimescaleDB server). If you're a Timescale Cloud user, you can find this information in the Service Dashboard for your Timescale Cloud instance.

For more information on unleashing insights and visualizing NYC taxi data with Timescale and Grafana, check out this tutorial on t8tech.com.

When connecting to a TimescaleDB instance in Timescale Cloud for this tutorial, it is crucial to select the 'TimescaleDB' option in the 'PostgreSQL details' section of the PostgreSQL configuration screen.

We will also rename the database to NYC Taxi Cab Data, which is optional but helps others who use our Grafana dashboard understand the nature of this data source.

Once complete, click 'Save & Test'. You should receive confirmation that your database connection is functioning correctly.

Test your Grafana database connection

Designing a Grafana Dashboard and Panel

Grafana organizes information into 'Dashboards' and 'Panels', where a dashboard represents a view into the performance of a system, and each dashboard consists of one or more panels that convey details about a specific metric related to that system.

To begin, let's create a new dashboard. In the far left of the Grafana user interface, you'll find a '+' icon. Hover over it to reveal a 'Create' menu, containing a 'Dashboard' option. Select that 'Dashboard' option.

After creating a new dashboard, you'll see a 'New Panel' screen, with options for 'Add Query' and 'Choose Visualization'. If you already have a dashboard with panels, you can add a new panel by clicking on the '+' icon at the top of the Grafana user interface.

To continue with our tutorial, let's add a new visualization by selecting the 'Choose Visualization' option.

At this point, you'll have several options for different Grafana visualizations. We'll choose the first option, the 'Graph' visualization.

Grafana visualizations to choose from

There are multiple ways to configure our panel, but we'll accept all the defaults and create a simple 'Lines' graph.

In the far left section of the Grafana user interface, select the 'Queries' tab.

How to create a new Grafana query

Instead of using the Grafana query builder, we'll edit our query directly. In the view, click on the 'Edit SQL' button at the bottom.

Customize SQL queries in Grafana for data visualization

Before crafting our query, we must set up the Query database to leverage the New York City taxi cab datasource we connected to earlier:

Selecting data sources in Grafana

Unveiling metrics stored in TimescaleDB

Let's create a visualization that answers the question What is the daily ride frequency? as outlined in the Hello, Timescale! tutorial.

As demonstrated in the tutorial, the standard SQL syntax for our query is as follows:

SELECT date_trunc('day', pickup_datetime) AS day,
  COUNT(*)
FROM rides
GROUP BY day
ORDER BY day;

We will need to alter this query to support Grafana’s unique query syntax.

Modifying the SELECT statement

First, we will modify the date_trunc function to use the TimescaleDB time_bucket function. You can consult the TimescaleDB API Reference on time_bucket for more information on how to use it properly.

Let’s examine the SELECT portion of this query. First, we will bucket our results into one day groupings using the time_bucket function. If you set the ‘Format’ of a Grafana panel to be ‘Time series’, for use in Graph panel for example, then the query must return a column named time that returns either a SQL datetime or any numeric datatype representing a Unix epoch.

So, part 1 of this new query is modified so that the output of the time_bucket grouping is labeled time as Grafana requires, while part 2 is unchanged:

SELECT
  --1--
  time_bucket('1 day', pickup_datetime) AS "time",
  --2--
  COUNT(*)
FROM rides

The Grafana __timeFilter function

Grafana time-series panels include a tool that enables the end-user to filter on a given time range. A “time filter,” if you will. Not surprisingly, Grafana has a way to link the user interface construct in a Grafana panel with the query itself. In this case, the $__timefilter() function.

In the modified query below, we will use the $__timefilter() function to set the pickup_datetime column as the filtering range for our visualizations.

SELECT
  --1--
  time_bucket('1 day', pickup_datetime) AS "time",
  --2--
  COUNT(*)
FROM rides
WHERE $__timeFilter(pickup_datetime)

Defining Query Parameters

Our ultimate goal is to categorize our visualization based on the chosen time intervals and arrange the results in a sequential manner. Consequently, our GROUP BY and ORDER BY statements will rely on the time parameter.

With these adjustments, our final Grafana query takes shape:

SELECT
  --1--
  time_bucket('1 day', pickup_datetime) AS time,
  --2--
  COUNT(*)
FROM rides
WHERE $__timeFilter(pickup_datetime)
GROUP BY time
ORDER BY time

Upon visualizing this query in Grafana, we obtain the following outcome:

Illustrating Time-Series Data in Grafana

IMPORTANT REMINDER: Don't overlook setting the time filter in the upper right corner of your Grafana dashboard. If you're utilizing the pre-built sample dataset for this example, you'll want to set your time filter around January 1st, 2016.

Currently, the data is aggregated into 1-day groupings. Let's modify the time_bucket function to aggregate into 5-minute groupings instead and compare the resulting graphs:

SELECT
  --1--
  time_bucket('5m', pickup_datetime) AS time,
  --2--
  COUNT(*)
FROM rides
WHERE $__timeFilter(pickup_datetime)
GROUP BY time
ORDER BY time

When we bring this query to life, it will take the following form:

Visualizing time-series data in Grafana

Unleash the Potential of Geospatial Insights in TimescaleDB

The NYC Taxi Cab dataset also includes the pickup location for each ride. In the Hello, Timescale! Tutorial, we explored rides originating near Times Square. Now, let’s take it a step further and chart rides with a distance traveled exceeding five miles within Manhattan.

We can achieve this in Grafana using the ‘Worldmap Panel’. To begin, create a new panel, select ‘New Visualization’, and choose the ‘Worldmap Panel’.

Once again, we’ll modify our query directly. In the Query screen, ensure you’ve selected your NYC Taxicab Data as the data source. In the ‘Format as’ dropdown, select ‘Table’. Click on ‘Edit SQL’ and enter the following query in the text window:

SELECT time_bucket('5m', rides.pickup_datetime) AS time,
       rides.trip_distance AS value,
       rides.pickup_latitude AS latitude,
       rides.pickup_longitude AS longitude
FROM rides
WHERE $__timeFilter(rides.pickup_datetime) AND
  ST_Distance(pickup_geom,
              ST_Transform(ST_SetSRID(ST_MakePoint(-73.9851,40.7589),4326),2163)
  ) < 2000
GROUP BY time,
         rides.trip_distance,
         rides.pickup_latitude,
         rides.pickup_longitude
ORDER BY time
LIMIT 500;

Let's dissect this query. Our initial goal is to create a visual representation of rides, with markers that indicate the distance traveled. Trips covering longer distances will receive distinct visual treatments on our map, with the trip_distance serving as the value for our plot, stored in the value field.

In the second and third lines of the SELECT statement, we're harnessing the pickup_longitude and pickup_latitude fields in the database, mapping them to variables longitude and latitude, respectively, to pinpoint exact locations.

In the WHERE clause, we're applying a geospatial boundary to identify trips within a 2000m radius of Times Square, effectively narrowing down our search.

Finally, in the GROUP BY clause, we're supplying the trip_distance and location variables, enabling Grafana to plot data with precision.

WARNING: This query may take some time, depending on your Internet connection speed. This is why we’re using the LIMIT statement for demonstration purposes, to avoid overwhelming the system.

Now, let’s configure our Worldmap visualization. Select the ‘Visualization’ tab, located at the far left of the Grafana user interface. You’ll find options for ‘Map Visual Options’, ‘Map Data Options’, and more, allowing for a high degree of customization.

First, ensure the ‘Map Data Options’ are set to ‘table’ and ‘current’. Then, in the ‘Field Mappings’ section, set the ‘Table Query Format’ to ‘Table’. We can map the ‘Latitude Field’ to our latitude variable, the ‘Longitude Field’ to our longitude variable, and the ‘Metric’ field to our value variable, creating a seamless connection between our data and visualization.

In the ‘Map Visual Options’, set the ‘Min Circle Size’ to 1 and the ‘Max Circle Size’ to 5, allowing for a range of visual representations.

In the ‘Threshold Options’, set the ‘Thresholds’ to ‘2,5,10’. This will automatically configure a set of colors, with any plot whose value is below 2 assigned one color, any value between 2 and 5 assigned another color, any value between 5 and 10 assigned a third color, and any value over 10 assigned a fourth color, providing a clear visual distinction.

Your configuration should resemble this:

Configuring Worldmap fields to query results in Grafana

At this point, data should be flowing into our Worldmap visualization, as shown, providing a comprehensive and interactive view of our data.

Exploring Time Series Data in PostgreSQL with Grafana's Interactive World Map

You can fine-tune the time filter at the top of your visualization to access trip pickup data across diverse time intervals.

Unlocking the Potential of Grafana Variables

Our goal is to create a variable that determines the type of ride displayed in the visual, based on the payment method used for the ride.

A wide range of payment options are available, which can be seen in the payment_types table:

 payment_type | description
--------------+-------------
            1 | credit card
            2 | cash
            3 | no charge
            4 | dispute
            5 | unknown
            6 | voided trip
(6 rows)

Grafana includes many types of variables, and variables in Grafana function just like variables in programming languages. We define a variable, and then we can reference it in our queries.

Define a new Grafana variable

To create a new variable, go to your Grafana dashboard settings, navigate to the ‘Variable’ option in the side-menu, and then click the ‘Add variable’ button.

In this case, we use the ‘Query’ type, where our variable will be defined as the results of SQL query.

Under the ‘General’ section, we’ll name our variable payment_type and give it a type of Query. Then, we’ll assign it the label of “Payment Type”, which is how it will appear in a drop-down menu.

We will select our data source and supply the query:

SELECT payment_type FROM payment_types;

Enable the 'Multi-value' and 'Include All option' functionalities. This will grant users of your dashboard the ability to choose multiple payment methods. The resulting configuration should resemble the following:

Utilizing a variable to filter results in a Grafana visualization

Click 'Add' to save your variable.

Integrating the Variable into a Grafana Panel

Let's modify the WorldMap panel we created in the previous section. The first thing you'll notice is that now that we've defined a variable for this dashboard, a dropdown menu for that variable has appeared in the upper left-hand corner of the panel.

We can utilize this variable to filter the results of our query using the WHERE clause in SQL. We will verify whether rides.payment_type is included in the array of the variable, which we've denoted as $payment_type.

Let's revise our earlier query as follows:

SELECT time_bucket('5m', rides.pickup_datetime) AS time,
       rides.trip_distance AS value,
       rides.pickup_latitude AS latitude,
       rides.pickup_longitude AS longitude
FROM rides
WHERE $__timeFilter(rides.pickup_datetime) AND
  ST_Distance(pickup_geom,
              ST_Transform(ST_SetSRID(ST_MakePoint(-73.9851,40.7589),4326),2163)
  ) < 2000 AND
  rides.payment_type IN ($payment_type)
GROUP BY time,
         rides.trip_distance,
         rides.pickup_latitude,
         rides.pickup_longitude
ORDER BY time
LIMIT 500;

Use the variable in a Grafana panel

We can now leverage the drop-down menu to classify our rides based on the payment method utilized:

Visual representation of time series data in PostgreSQL using the Grafana Worldmap and filtering via a variable

Enhancing the Grafana filter for better readability

However, this filter lacks visual appeal. The numerical values, such as ‘1’, are ambiguous. Fortunately, when we set up our NYC Taxi Cab dataset, we created a payment_types table (which we queried earlier). The payment_types.description field contains a more descriptive explanation of each payment code, for instance, ‘credit card’, ‘cash’, and so forth. It is these descriptive labels that we want to feature in our drop-down menu.

Click ‘Dashboard settings’ (the “gear” icon in the upper-right of your Grafana visualizations). Select the ‘Variables’ tab on the left, and click the $payment_types variable. Modify your query to retrieve the description and store it in the __text field, and retrieve the payment_type and store it in the __value field, as follows:

SELECT description AS "__text", payment_type AS "__value" FROM payment_types

Your configuration should now mirror the following setup:

Customize our Grafana variable for enhanced readability

No modifications are required for the WorldMap visualization query itself. The database column designated as __text is leveraged whenever the variable is displayed, whereas the value assigned to __value serves as the actual value when Grafana executes a query.

As depicted, a variable can be utilized in a query in a manner akin to using a variable in any programming language.

. . . . . . . . .
Terabox Video Player