RockOn pt. 1: Using Nokogiri to Seed my Database

dianakw8591 - Jun 26 '20 - - Dev Community

As my final project for Flatiron School's software development program I chose a topic near and dear to my heart: rock climbing. My previous job had been as a climbing ranger in Yosemite, and I've invested years in the sport and traveled all over the world for it. I was actually in Spain climbing (where this cover photo was taken) while I was applying to software engineering programs. On rest days I'd practice programming fundamentals and work through pre-coursework for bootcamps. I would also take careful notes about all the climbing I was doing, something I've done for years. It's not uncommon for climbers to keep a logbook or journal, whether in paper or spreadsheet format, and it's a nice way to mark accomplishments, keep notes and track progress. Here's my entry from the cover photo day, the last climb of the trip on Pa Ella y Pa los Guiris:

Log entry with climbs, grades and dates
Not super legible is it?

But here's the problem with my current logbook: I can hardly read my handwriting! Never mind if I'm curious about how many days I've climbed in a year, or how many routes I've climbed of a certain type or grade. That's a lot of time spent going through and adding up all of my entries by hand!

Thus the concept for my final project: a digital climbing logbook that allows climbers to easily enter new climbs and visualize their climbing history. I'll talk about the data visualization in a later post, but in this post I'll discuss the first major hurdle I encountered: how to give a user easy access to information about a specific rock climb.

For any given ascent, there are some 'subjective' features and some 'objective' features. Subjective features include information like did you fall or not, and any notes about your experience on the climb. Objective features are things like grade (how hard the route is) and length. That's the type of information that can be found in a guidebook, and while there can be intense debate in the community about a some of these objective features, particularly grade, it's information that is inherent to a given climb. Therefore, I did not want my users to have to manually enter names, grades, and lengths for every ascent. Rather, they should be able to select that information from a database and only enter information specific to their experience on the climb.

Thankfully, this information is freely available on the internet in the form of Mountain Project. For North American rock climbing particularly, Mountain Project's database is quite thorough, and they offer an API to access route and area information. However, their API only has endpoints for route IDs or latitude, longitude coordinates, and I wanted users to be able to search for routes by name. I couldn't use the route ID endpoint from my front end because a user has no idea what a given climb's ID is! I considered using a map where the user could select a point near their climbing location, and using that coordinate I could return a list of climbs from Mountain Project within a certain radius. That might have worked well in some scenarios, but in areas with thousands of climbs, a coordinate is not specific enough for an easy search experience through the returned list. Plus, locating a climbing area on a map is just not as easy as typing in a route name! Ultimately, I decided that in order to provide the best user experience, I needed to seed my own database and allow a user to search that by route name or area name.

My end goal was to make the entirety of Mountain Project searchable in my own database, which meant first getting every route ID. Enter Nokogiri, an HTML (as well as XML, SAX, and Reader) parser. If I could figure out where the route IDs existed in the HTML structure of Mountain Project's pages, I could open each page with Open-URI (another module that ships with Ruby) and use Nokogiri's CSS and HTML selectors to access the IDs.

Nokogiri is a dependency in every Rails project, so it's extremely easy to get started. I simply required Open-URI at the top of my file, and then used Open-URI and Nokogiri together like this:

require 'open-uri'
...
doc = Nokogiri::HTML(URI.open(url))

where url is the url of the page I wanted to parse. Now that I had access to the HTML, I just needed to use the right selectors to get the information I was interested in - mainly area name and ID and route ID.

Mountain Project's information is structured hierarchically with areas containing sub-areas or routes but never both. Therefore, my general strategy was to start at a high-level area page. If that page contained sub-areas, I would recursively go through each of those. If an area contained routes, I would grab the route IDs.

Let's walk through my code for an area page: Yosemite National Park. The name of the area is a big header at the top, and it's easy to find the <h1> I'm looking for in the HTML with Chrome's developer tools:
Mountain Project's Yosemite page with corresponding HTML
To select the inner text of that particular header I consulted Nokogiri's documentation for available methods and used a fair amount of trial and error. My final code looks like this:

name = doc.css('h1')[0].children[0].text.strip

where .css means I'm using CSS selectors, in this case selecting <h1> tags. That query returns a node set, similar to an array, from which I want 0th item. From there I want the first child text and to remove any leading or trailing whitespace characters. With the name string successfully selected, I create a new Area entry in my own database.

At this point I need to check if the area contains sub-areas or routes. Sub-areas are identified by a class of left-nav-row as seen here:
Yosemite sub-areas HTML
To select that with CSS selectors, I selected all the <a> tags within divs of that class:

areas = doc.css('div.lef-nav-row a')

If that node set was not empty, I access the href attribute and use Regex matching to pull out the area ID:

area_ids = areas.map { |u| u.attr('href').match(/[0-9]+/).to_s }

If the areas node set was empty, I look for route information instead. On an area page that contains routes, route IDs are found in a table with an ID of left-nav-route-table:
Yosemite sub-area with routes and corresponding HTML
and the code to access the routes:

routes = doc.css('table#left-nav-route-table a')

Accessing the route ID is similar to accessing area ID:

route_ids = routes.map { |a| a.attr('href').match(/[0-9]+/).to_s }

and with route IDs in hand I create Climb entries and use the IDs later to hit Mountain Project's API and flesh out the route information.

Here's the full code for my scraping method:

def getIds(areaId, parentId)
  base_url = 'https://www.mountainproject.com/area/'
  url = base_url + areaId
  doc = Nokogiri::HTML(URI.open(url))
  #get area name and save a new area
  name = doc.css('h1')[0].children[0].text.strip
  puts('in area:', name)
  current_area = Area.create(name: name, parent_id: 
    parentId, mtnproj_id: areaId)
  #check for subareas
  areas = doc.css('div.lef-nav-row a')
  if areas.empty?
    #get routes from this area
    routes = doc.css('table#left-nav-route-table a')
    route_ids = routes.map { |a| 
       a.attr('href').match(/[0-9]+/).to_s }
    route_ids.each { |route_id| Climb.create(mtnproj_id: 
       route_id, area_id: current_area.id) }
  else
    area_ids = areas.map { |u| u.attr('href').match(/[0- 
       9]+/).to_s }
    area_ids.each { |id| getIds(id, current_area.id)}
  end
end

The puts statement is a visual indicator that my code is progressing through areas as expected, and lets me know which area to start from if my code gets hung up, which did happen a couple times. I tested my method on small areas first with few subareas and routes, and after verifying that it worked as expected, ran it on larger parent areas, ultimately scraping all 222,195 route IDs from Mountain Project. After that it was simple to access Mountain Project's API for the rest of the route information I wanted.

Going forward I would like to create a background process that updates my route information on a weekly basis so that climbs added to Mountain Project are updated in my database as well. However, the scraping code took days to run through all of Mountain Project's pages, so I'd like to write a more efficient solution for updates.

Next week I'll walk through how I made my database searchable from the front end and the challenges I encountered in making a user-friendly entry form. For now you can check out RockOn or watch my demo video.

Thanks for reading!

. . . . . . . . . . . . . .
Terabox Video Player