First Steps
Web scraping is the process of extracting data from a web page's source code, rather than through some API exposed by the owner(s) of that page. It can be a bit tricky at first, but it allows you to easily pull and organise lots of information from the web, without having to manually copy and paste anything.
To do some basic web scraping today, I'll use the Python library BeautifulSoup
. If you haven't used this package before, you'll need to install it. The easiest way to do that is with the Python package manager pip
. First, check if you have pip
on your machine by trying to install a library with it:
$ pip install beautifulsoup4
If you have Python but don't have pip
(if the above throws an error), install pip
by itself using the instructions found here. macOS and most Linux distributions come with Python by default, but if you're on Windows and you need to install Python, try the official website.
Python 2.7 is deprecated as of 1 January 2020, so it might be better to just get Python 3 (if you don't yet have it). I don't have Python 3 yet (because I just factory reset my Mac not too long ago), so I'm installing it first using these instructions, which essentially just boil down to:
$ brew install python
Now, we can check that both Python 2 and Python 3 are installed, and that pip
was installed alongside Python 3:
$ python --version
Python 2.7.10
$ python3 --version
Python 3.7.2
$ pip --version
-bash: pip: command not found
$ pip3 --version
pip 19.0.2 from /usr/local/lib/python3.7/site-packages/pip (python 3.7)
Finally, let's get BeautifulSoup
using pip3
:
$ pip3 install beautifulsoup4
Note that, at this point, you could use the "normal" Python interpreter with the python3
command, or you could use the more feature-rich IPython by installing:
$ pip3 install ipython
Throughout this tutorial, I'll be using IPython.
Preliminary Research
My motivation for this project was that I wanted to create an "average profile" of a developer at a given level in a given area, based on job postings on Indeed and similar websites. While doing something like that is a bit involved and might involve some regex, a good place to start would be to simply see how often a given technology is listed in job postings: more mentions == more important, right?
BeautifulSoup
lets you access a page's XML / HTML tags by their type, id
, class
, and more. You can pull all <a>
tags, for instance, or get the text of all <p>
tags with a particular class
. So to pull data out in a regular way, we need to dissect the structure of the pages we want to scrape. Let's start by doing a search for JavaScript developers in New York City:
Note the URL of this web page:
https://www.indeed.com/jobs?q=javascript+developer&l=New+York+City
If we go to the second page of results, it changes to:
https://www.indeed.com/jobs?q=javascript+developer&l=New+York+City&start=10
...and the third page of results:
https://www.indeed.com/jobs?q=javascript+developer&l=New+York+City&start=20
Right, so there are 10 results per page and each page after the first has an additional parameter in the URL: &start=...
, where ...
is a positive multiple of 10. (As it turns out, we can append &start=0
to the URL of the first page and it returns the same results.) Okay, so we know how to access pages of results... what's next? How about we inspect the structure of the first results page:
One thing I notice is that the links to each job ad seem to have an onmousedown
which changes predictably. The first one is
onmousedown="return rclk(this,jobmap[0],0);"
...the second is
onmousedown="return rclk(this,jobmap[1],0);"
...and so on. I would bet that we can pull all <a>
tags with an onmousedown
containing "return rclk(this,jobmap[
" and that would give us all the links to all the jobs listed on this page. Let's put that in our back pocket for now and open one of these ads -- let's see if we can figure out where the job specifications are within these pages:
It looks like the main body of the ad is contained in a <div>
with class="jobsearch-JobComponent-description"
. That sounds like a pretty specific div
. I'll just go ahead and assume that's the same on every page, but you can check if you like. So now that we know the structure of the URLs we want to visit, how to find links to job ads on those pages, and where the text of the ad is contained in those subpages, we can build a web scraping script!
Building the Scraper
Let's start by just looping over search pages. Our URL will look something like:
https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=
...but we need to append a non-negative multiple of 10 to the end. An easy way to do this in Python is to create a range
loop:
In [91]: for pageno in range(0,10):
...: search = "https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=" + str(10*pageno)
...: print(search)
...:
https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=0
https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=10
https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=20
...
https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=90
That looks good! Note that we had to convert the integer to a string with Python's str()
method.
What we really want to do is actually visit these pages and extract their content. We can do that with Python's urllib
module -- specifically urllib.request.urlopen()
(Python 3 only). We can then parse the page with BeautifulSoup
by simply calling the BeautifulSoup
constructor. To test this, let's temporarily reduce our loop range to just one page and print the contents of the page with soup.prettify()
:
In [100]: for pageno in range(0,1):
...: search = "https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=" + str(10*pageno)
...: url = urllib.request.urlopen(search)
...: soup = BeautifulSoup(url)
...: print(soup.prettify()[:500])
...:
<!DOCTYPE html>
<html dir="ltr" lang="en">
<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<script src="/s/a3599cf/en_US.js" type="text/javascript">
</script>
<link href="/s/97464e7/jobsearch_all.css" rel="stylesheet" type="text/css"/>
<link href="http://rss.indeed.com/rss?q=javascript&l=New+York+City" rel="alternate" title="Javascript Jobs, Employment in New York, NY" type="application/rss+xml"/>
<link href="/m/jobs?q=javascript&l=New+York+City" m
I trimmed the output using string slicing, limiting it to 500 characters (the source code of this page is pretty long). You can see just in that short snippet, though, our original search: q=javascript&l=New+York+City
.
Great! So, this seems to work. Let's use select()
now to grab all of the job ad links on this page. Remember that we're looking for all of the <a>
tags with an onmousedown
containing "return rclk(this,jobmap[
". We have to use a special syntax to achieve that result, see below:
In [102]: for pageno in range(0,1):
...: search = "https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=" + str(10*pageno)
...: url = urllib.request.urlopen(search)
...: soup = BeautifulSoup(url)
...:
...: for adlink in soup.select('a[onmousedown*="return rclk(this,jobmap["]'):
...: subURL = "https://www.indeed.com" + adlink['href']
...: print(subURL)
...:
https://www.indeed.com/rc/clk?jk=43837af9ab727a8b&fccid=927356efef1f3075&vjs=3
https://www.indeed.com/rc/clk?jk=6511fae8b53360f1&fccid=f057e04c37cca134&vjs=3
https://www.indeed.com/company/Transport-Learning/jobs/React-HTML-Javascript-Developer-ca898e4825aa3f36?fccid=6b6d25caa00a7d0a&vjs=3
...
https://www.indeed.com/rc/clk?jk=9a3a9b4a4cbb3f28&fccid=101a2d7616184cc8&vjs=3
We append "https://www.indeed.com" to the beginning of each link because, in the source code of the page, all the href
s are relative. If we grab one of these links (say the third one) and paste it into the browser, we should hopefully get a job ad:
...looking good! Okay, what's next? Well, we want to, again, open these subpages with BeautifulSoup
and parse the source code. But this time, we want to look for <div>
s with a class
that contains jobsearch-JobComponent-description
. So let's use string slicing again and print the first, say, 50 characters of each page, just to make sure that all of these URLs are working:
In [103]: for pageno in range(0,1):
...: search = "https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=" + str(10*pageno)
...: url = urllib.request.urlopen(search)
...: soup = BeautifulSoup(url)
...:
...: for adlink in soup.select('a[onmousedown*="rclk(this,jobmap"]'):
...: subURL = "https://www.indeed.com" + adlink['href']
...: subSOUP = BeautifulSoup(urllib.request.urlopen(subURL))
...: print(subSOUP.prettify()[:50])
...:
<html dir="ltr" lang="en">
<head>
<title>
Ne
<html dir="ltr" lang="en">
<head>
<title>
Re
<html dir="ltr" lang="en">
<head>
<title>
Re
...
<html dir="ltr" lang="en">
<head>
<title>
Ni
Again, great! Everything's working so far. The next thing to do would be to try to extract the text of the main body of each ad. Let's use the same *=
syntax in select()
that we used previously to find <div>
s in these subpages which have a class
attribute which contains jobsearch-JobComponent-description
:
In [106]: for pageno in range(0,1):
...: search = "https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=" + str(10*pageno)
...: url = urllib.request.urlopen(search)
...: soup = BeautifulSoup(url)
...:
...: for adlink in soup.select('a[onmousedown*="rclk(this,jobmap"]'):
...: subURL = "https://www.indeed.com" + adlink['href']
...: subSOUP = BeautifulSoup(urllib.request.urlopen(subURL))
...:
...: for desc in subSOUP.select('div[class*="jobsearch-JobComponent-description"]'):
...: print(desc.get_text()[:50])
...:
Impact
Ever wondered how Amazon offers the Earth'
Mobile & Web Engineering is looking for talented w
Job Description
We are looking for a talented Fro
$75,000 - $95,000 a yearYour first few months:We c
Michael Kors is always interested in hearing from
Facebook's mission is to give people the power to
$70,000 - $80,000 a yearWe Make Websites are the g
InternshipApplications are due by June 27, 2019 at
Job Overview:
UI Developer should have a very goo
* THIS IS A REMOTE POSITION *
At Dental Intellige
BeautifulSoup.select()
returns the HTML / XML tags which match the search parameters that we provide. We can pull attributes from those tags with bracket notation (as in adlink['href']
) and we can pull the text contained within opening and closing tags (for instance, between <p>
and </p>
) with get_text()
, as we did above. The subSOUP.select()
statement returns a list of <div>
tags, with class
attributes that contain the substring "jobsearch-JobComponent-description
", then we use a for ... in
loop to get each <div>
in that list (there's only one) and print the text contained within <div>
... </div>
with get_text()
.
The result is this list of jumbled text. It doesn't make any sense because we cut each description off after only 50 characters. But now we have our fully-functional Indeed job ad scraper! We just need to figure out what to do with these results to complete our task.
Organizing Your Web Scrapings
The easiest thing to do is to come up with a list of keywords we're interested in. Let's look at the popularity of various JavaScript frameworks. How about:
frameworks = ['angular', 'react', 'vue', 'ember', 'meteor', 'mithril', 'node', 'polymer', 'aurelia', 'backbone']
...that's probably a good start. If you're familiar with processing text data like this, you'll know that we have to convert everything to lowercase to avoid ambiguity between things like "React" and "react", we'll have to remove punctuation so we don't count "Angular" and "Angular," as two separate things, and we can easily split this text into tokens on spaces using split()
. Let's first split the text of each ad, convert each word to lowercase, and see what our list of words looks like:
In [110]: for pageno in range(0,1):
...: search = "https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=" + str(10*pageno)
...: url = urllib.request.urlopen(search)
...: soup = BeautifulSoup(url)
...:
...: for adlink in soup.select('a[onmousedown*="rclk(this,jobmap"]'):
...: subURL = "https://www.indeed.com" + adlink['href']
...: subSOUP = BeautifulSoup(urllib.request.urlopen(subURL))
...:
...: for desc in subSOUP.select('div[class*="jobsearch-JobComponent-description"]'):
...: words = desc.get_text().lower().split()[:50]
...: for word in words:
...: print(word)
...:
mobile
&
web
engineering
is
looking
for
talented
web
developers
to
join
the
digital
acquisitions
engineering
group.
...
...and so on. Let's pick out some weird ones:
group.
role,
summary:
recoded:you'd
limitless.we
react.within
...right, so we'll have to split on spaces as well as .
, ,
, and :
. Elsewhere in the list, we have:
2.0-enabled
which will, of course, be damaged by splitting on .
, but I think the benefits outweigh the costs here. We also have lots of hyphenated words like
blue-chip
data-driven,
hyper-personalized,
go-to
team-based
e-commerce
...so we probably shouldn't split on hyphens or dashes. We do however have one or two
trends/development
qa/qc
...so we'll want to split on /
as well. Finally, there's nothing we can do about typos like:
analystabout
part-timeat
contractlocation:
yearyour
...at the moment, so we'll have to leave those as-is. To make this solution a bit more robust, we want to split on multiple separators, not just the space character. So we need Python's regular expression library re
:
In [110]: import re
In [111]: for pageno in range(0,1):
...: search = "https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=" + str(10*pageno)
...: url = urllib.request.urlopen(search)
...: soup = BeautifulSoup(url)
...:
...: for adlink in soup.select('a[onmousedown*="rclk(this,jobmap"]'):
...: subURL = "https://www.indeed.com" + adlink['href']
...: subSOUP = BeautifulSoup(urllib.request.urlopen(subURL))
...:
...: for desc in subSOUP.select('div[class*="jobsearch-JobComponent-description"]'):
...: words = re.split("[ ,.:/]", desc.get_text().lower())[:50]
...: for word in words:
...: print(word)
...:
impact
ever
wondered
how
amazon
offers
the
earth's
biggest
selection
and
still
...
Right. So now what weirdos do we have?
earth's
customers?
$75
000
-
$95
000
(both
ios
and
android)
facebook's
$70
000
-
$80
000
11
59pm
*
So, still a few edge cases. Easy-to-fix ones include removing trailing 's
from words and adding ?
, (
, and )
to the list of separator characters (as well as whitespace characters like \n
, \t
, and \r
). (One more quick scan reveals that we should add !
to the list of separator characters as well, obviously.) We can also ignore words that are only a single character or less. Fixing the problems with times (11:59pm) and salaries ($70,000 - $80,000) are a bit more involved and won't be covered here. For now, we'll just ignore those. So let's check out our improved scraper:
In [121]: for pageno in range(0,1):
...: search = "https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=" + str(10*pageno)
...: url = urllib.request.urlopen(search)
...: soup = BeautifulSoup(url)
...:
...: for adlink in soup.select('a[onmousedown*="rclk(this,jobmap"]'):
...: subURL = "https://www.indeed.com" + adlink['href']
...: subSOUP = BeautifulSoup(urllib.request.urlopen(subURL))
...:
...: for desc in subSOUP.select('div[class*="jobsearch-JobComponent-description"]'):
...: words = re.split("[ ,.:/?!()\n\t\r]", desc.get_text().lower())[:50]
...: for word in words:
...: word = word.strip()
...: if word.endswith("'s"):
...: word = word[:-2]
...: if len(word) < 2:
...: continue
...: print(word)
...:
Beautiful! Now, what can we do with it?
Insights
Instead of simply printing a list of words, let's add them to a dictionary. Every time we encounter a new word, we can add it to our dictionary with an initial value of 1, and every time we encounter a word we've seen before, we can increment its counter:
In [123]: counts = {}
...:
...: for pageno in range(0,1):
...: search = "https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=" + str(10*pageno)
...: url = urllib.request.urlopen(search)
...: soup = BeautifulSoup(url)
...:
...: for adlink in soup.select('a[onmousedown*="rclk(this,jobmap"]'):
...: subURL = "https://www.indeed.com" + adlink['href']
...: subSOUP = BeautifulSoup(urllib.request.urlopen(subURL))
...: print("Scraping: " + subURL + "...")
...:
...: for desc in subSOUP.select('div[class*="jobsearch-JobComponent-description"]'):
...: words = re.split("[ ,.:/?()\n\t\r]", desc.get_text().lower())[:50]
...: for word in words:
...: word = word.strip()
...: if word.endswith("'s"):
...: word = word[:-2]
...: if len(word) < 2:
...: continue
...: if word in counts:
...: counts[word] += 1
...: else:
...: counts[word] = 1
...:
...: print(counts)
...:
Scraping: https://www.indeed.com/company/CypressG/jobs/Newer-Javascript-Framework-Developer-5a17b0475e76de26?fccid=dc16349e968c035d&vjs=3...
Scraping: https://www.indeed.com/company/Transport-Learning/jobs/React-HTML-Javascript-Developer-ca898e4825aa3f36?fccid=6b6d25caa00a7d0a&vjs=3...
Scraping: https://www.indeed.com/rc/clk?jk=a0727d28799f1dff&fccid=5d5fde8e5925b19a&vjs=3...
...
Scraping: https://www.indeed.com/rc/clk?jk=b084048e6a1b2727&fccid=5d5fde8e5925b19a&vjs=3...
{'$80': 1, '000': 8, '$250': 1, 'yeari': 1,...
I added a "Scraping" echo to the user so we can be sure our script is progressing. Note that the resulting dictionary is not ordered! If we want to order it by value, there are a few different ways we can do that, but the easiest one is probably to just turn it into a list of tuples, flipping the keys and values so we can easily sort by key (number of occurrences of a particular word):
word_freq = []
for key, value in counts.items():
word_freq.append((value,key))
word_freq.sort(reverse=True)
We sort by reverse=True
so it's sorted high-to-low, and the most common words are at the top of the list. Let's see the result:
[(19, 'to'), (13, 'and'), (12, 'the'), (11, 'for'), (9, 'of'), (9, 'is'), (6, 'we'), (6, 'in'), (6, '000'), (5, 'you')]
Of course, the reason we want to pick specific words out (like "angular", "react", etc.) is because we'll get a bunch of useless filler words (like "to", "and", etc.) otherwise. Let's define a list of "good" words, check our word
against the list, and only count ones that we care about. Finally, I'll also get rid of the [:50]
slice which we used for debugging, and expand my search to the first 100 pages of results. Here is the final script:
In [127]: counts = {}
...: frameworks = ['angular', 'react', 'vue', 'ember', 'meteor', 'mithril', 'node', 'polymer', 'aurelia', 'backbone']
...: max_pages = 100
...: ads_per_page = 10
...: max_ads = max_pages * ads_per_page
...:
...: for pageno in range(0, max_pages):
...: search = "https://www.indeed.com/jobs?q=javascript&l=New+York+City&start=" + str(ads_per_page * pageno)
...: url = urllib.request.urlopen(search)
...: soup = BeautifulSoup(url)
...: this_page_ad_counter = 0
...:
...: for adlink in soup.select('a[onmousedown*="rclk(this,jobmap"]'):
...: href = adlink['href']
...: subURL = "https://www.indeed.com" + href
...: subSOUP = BeautifulSoup(urllib.request.urlopen(subURL))
...: ad_index = this_page_ad_counter + pageno*ads_per_page
...: print("Scraping (" + str(ad_index + 1) + "/" + str(max_ads) + "): " + href + "...")
...: this_page_ad_counter += 1
...:
...: for desc in subSOUP.select('div[class*="jobsearch-JobComponent-description"]'):
...: words = re.split("[ ,.:/?()\n\t\r]", desc.get_text().lower())
...: for word in words:
...: word = word.strip()
...: if word.endswith("'s"):
...: word = word[:-2]
...: if word.endswith(".js"):
...: word = word[:-3]
...: if word.endswith("js"):
...: word = word[:-2]
...: if len(word) < 2:
...: continue
...: if word not in frameworks:
...: continue
...: if word in counts:
...: counts[word] += 1
...: else:
...: counts[word] = 1
...:
...: word_freq = []
...:
...: for key, value in counts.items():
...: word_freq.append((value,key))
...:
...: word_freq.sort(reverse=True)
...:
...: print(word_freq)
...:
Scraping (1/1000): /rc/clk?jk=72b4ac2da9ecb39d&fccid=f057e04c37cca134&vjs=3...
Scraping (2/1000): /company/Transport-Learning/jobs/React-HTML-Javascript-Developer-ca898e4825aa3f36?fccid=6b6d25caa00a7d0a&vjs=3...
Scraping (3/1000): /rc/clk?jk=9a3a9b4a4cbb3f28&fccid=101a2d7616184cc8&vjs=3...
...
I made some small aesthetic changes... can you see where they are? I also made sure to remove ".js" or "js" from the end of any framework names so they're not counted as separate things. I removed the "magic number" 10 from the script and put it in a descriptive variable (ads_per_page
). Also, I created a variable (max_pages
) which says I should only look at 100 pages of results, so in total, I'll look at the 1000 most recent "Javascript" ads posted on Indeed in the NYC area.
This is going to take a while, so I'll go grab some coffee and come back...
...so, what does the result look like?
[(556, 'react'), (313, 'angular'), (272, 'node'), (105, 'vue'), (45, 'backbone'), (36, 'ember'), (4, 'polymer')]
So, out of 1000 ads scraped, 556 mentioned "react", 313 mentioned "angular", and so on. Quite a bit of insight from a quick script!
Applications
With some more work, this could be turned into a website / app where developers (or anyone) looking for a job could find out what the average requirements are ("...56% of ads requested experience with React..."), what the average salary is ("...$55,000 +/- $2,000..."), and benchmark themselves against those averages. Such a tool would be really useful in salary negotiations, or when trying to decide what new technologies / languages to learn to advance your career. Data could be kept current by tracking ad posting dates and throwing out stale information (older than, say, a week).
This information would also be useful to employers, giving them a better idea of where to set salaries for certain positions, levels of experience, and so on. Indeed was just the first step, but this scraping could easily be expanded to multiple job posting websites.
This prototype only took a few hours' work for one person with limited Python experience. I would imagine that a small team of people could get this app up and running in just a few weeks. Thoughts? Does anyone know of anything similar?