Web scraping is a handy skill to have if you want to create a dataset of your own. If you want to create a package that creates new datasets, this will be very helpful. If not, it is still a great way to practice making your own functions.
Today we’ll practice scraping some information about the SDS faculty from the Smith website. The same principles can be applied to any simple website (those without interactive elements). Just remember that you must be very careful when constructing scraper bots to make sure you obey the terms of service for the site you are scraping, and that the bots you build are polite.
Failure to build polite bots can result in your (or the school’s) IP address being banned from a website forever.
Remember from our lecture that the legalities of web scraping are in a grey area. The legalities depend on several factors including:
In general, you should never scrape:
Rather than working with a dataset today, you will be making one. Our goal is to start with the Statistical & Data Sciences webpage, and end up with a dataframe containing the name, title, and URL for all the Smith SDS faculty.
Before we start writing any code, we need to make sure we are allowed
to scrape the website. A good first check is the robots.txt
of the
website. For Smith’s site, that would be
https://www.smith.edu/robots.txt. On this page we are given a map of
where we are allowed and not allowed to scrape. Anything page listed
with “Disallow” is off limits, as is anything under that directory. For
example, if www.site.com/page1
is off limits, so is
www.page.com/page1/sub_page
. Seems the parent directory of the SDS
page, www.smith.edu/academics
, is in the clear.
Once that quick check is passed, we have to do the harder work of reading and understanding the website Terms of Service (ToS); you can find Smith’s here. Yes, you actually need to read it. You are actually bound by these terms just by looking at the site, so it is worth the time.
The first thing you may notice is under section 4, that we are authorized to save pages from the site onto one single hard drive. That conversely means we are not able to save multiple copies, or save it anywhere that will be shared online. You may also notice that section 7.2 mentions scraping, in that we cannot use it to access parts of the site not made publicly available. What we want is publicly available, so we should be in the clear.
What other things do you find interesting regarding the Terms of Service for the Smith website?
Now that we know what we are and are not allowed to do, let’s go look at the SDS page. Our goal here is to figure out what we need to get from this page to progress us closer to our goal. We know we want to look at each individual faculty box, so what could we do to get a link to all of them?
We could just go to each faculty member, copy the data into an excel file, and proceed from there. That would work for the 15 items we want here, but what if we wanted 150? What if we wanted to re-use our code for another department? Best to write a programmatic solution.
Most of the information we want is clearly visible, but the link to each
faculty page is not. We know that clicking on each faculty portrait will
take us to their page, so there is a link in there somewhere, we just
need to get it out. We’ll use SelectorGadget
to help with that. First,
we’ll need to add it to our bookmarks bar. Right click on the bookmarks
bar in your browser of choice, and make a new bookmark called
“SelectorGadget”, and set the URL to the following:
javascript:(function(){var%20s=document.createElement('div');s.innerHTML='Loading...';s.style.color='black';s.style.padding='20px';s.style.position='fixed';s.style.zIndex='9999';s.style.fontSize='3.0em';s.style.border='2px%20solid%20black';s.style.right='40px';s.style.top='40px';s.setAttribute('class','selector_gadget_loading');s.style.background='white';document.body.appendChild(s);s=document.createElement('script');s.setAttribute('type','text/javascript');s.setAttribute('src','https://dv0akt2986vzh.cloudfront.net/unstable/lib/selectorgadget.js');document.body.appendChild(s);})();
That looks like gibberish to us, but the computer will get it.
Once that is done, save the bookmark, and get back to the SDS page. While you are on the page, click on the “SelectorGadget” bookmark in your bar. A grey and white bar will appear in the lower right of your browser, and orange boxes will start to appear wherever you have your mouse. Click on the thing you want to find, in this case the name of a faculty member; it doesn’t matter which one, just make sure the highlight box is only around the name. If you did it right, the name you clicked on will highlight green, and the rest will highlight yellow like the following.
We’re not done yet though. Look around and make sure we haven’t included anything extra.
Look over the page and make sure only the names are highlighted. If anything else is highlighted, click on it to turn it red and exclude it.
The Program Committee header needs to be excluded.
Look at the SelectorGadget
grey box in the lower right and you will
see a short string starting with .fac-inset
, that’s the HTML section
for our names. Copy the whole string and keep it somewhere handy.
The first step of our scraping is to find the names for all the faculty. The dataframe we create will also be used to store the rest of our data later.
Start by loading rvest
, and scraping the whole SDS page into R. This
can be done with the read_html()
function, much like reading a CSV.
Save the webpage to an object called sds_home
. It is important to note
this is the step that can get us in trouble. Once we have the page in
R, we are working with it like anything else on our computer. But the
process of reading the page from the internet can cause problems if we
do so too fast. From the robots.txt
page, we know Smith only wants us
to pull one page every 10 seconds. If we do any more, they can
ban us from the website. Be careful!
Scrape the SDS home page into R and store it in an object called
sds_home
.
library(rvest)
sds_page = read_html(‘https://www.smith.edu/academics/statistics’)
Once we have the whole page, we can start pulling information from it. The usual workflow here is to tell R what HTML structure we are interested in, and then what we want from it. For example, we can say we want the names of faculty using the selector path we found previously.
To do that, we need to say “from this page, look at this structure, and
give me the contents.” In R, that corresponds to the html_elements()
and html_text2()
functions. Give our sds_home
object to
html_elements()
, and as an argument, specify that css =
the string
we got from SelectorGadget
earlier. Either pipe the results from that,
or wrap it in the html_text2()
function to get the actual names of the
faculty.
There is also a html_element()
function
(singular, not plural). This will only give you the first thing.
Create a dataframe called sds_faculty
with a column called name
for
all the SDS faculty names.
sds_faculty = data.frame(‘name’ = html_text2(html_elements(sds_page, ‘.fac-inset h3’)))
Next we want to get the titles for all the faculty. The process is
exactly the same as the above, but we need to set a different target
using SelectorGadget
Using the same process as before, add the title of each faculty member
to our sds_faculty
dataframe into a new column called title
.
sds_faculty$title = html_text2(html_elements(sds_page, ‘.fac-inset p’))
This last one will be a bit different. Rather than wanting the actual
text on the page, we want the link that the text is tied to; i.e. when
we click on a faculty name, it follows a link to their individual pages.
Rather than using html_text2()
, we will use the more general
html_attr()
function. This lets us have more control to tell R that we
want the link the text is representing, not the text itself.
In HTML speak, the page a link points to is designated by the href
, or
Hypertext Reference
. We need to tell R that is what we want. We can do
that by passing “href” to the name
argument of html_attr()
; “href”
is the name of the attribute we want to get.
Using html_elements()
and html_attr()
, get the links from the
faculty names and add them to our dataframe in a column called link
.
sds_faculty$link = html_attr(html_elements(sds_page, ‘.linkopacity’), name = ‘href’)
You may be thinking “that’s kind of neat, but it doesn’t tell me anything I can’t see with my own eyes.” You’d be right. However, in a typical web scraping process, this is only step one. We now have a column of all the links for each of the individual faculty pages. If we were to write code that iterates over those links, we could then get more specific info from each faculty member. We could add things like email, office location, educational history, etc. to our dataframe. Once we figure that out, we could also iterate over all of the departments at smith, and before you know it, we have a full blown database on our hands.
With this sort of power, you must be very careful. Be sure to build
polite bots that obey the website rules, especially with how fast they
iterate through pages. Always use the Sys.sleep()
function to give
your bot a break between each page. The Smith site specifically asks
that you wait 10 second between each page, so your code should include a
Sys.sleep(10)
inside each iteration.