library(rvest)
# get the home page into R
= read_html('https://www.smith.edu/academics/statistics')
sds_page
# make a dataframe with all the names
= data.frame('name' = html_text2(html_elements(sds_page, '.fac-inset h3')))
sds_faculty
# add titles to the dataframe
$title = html_text2(html_elements(sds_page, '.fac-inset p'))
sds_faculty
# get the relative links to each faculty page
$link = html_attr(html_elements(sds_page, '.linkopacity'), name = 'href') sds_faculty
Lab 4. Web Scraping & RegEx
Introduction
Click here to access the lab on Github Classroom: Github Classroom Assignment for Lab 4: Web Scraping & RegEx
Web scraping can give you the ability to create your own datasets from online resources. While getting information from one page–like our worksheet from Monday–can be helpful, it is really only the first step of building a proper dataset. Today we will be continuing our work scraping the Smith website, with the goal of practicing a full web scraping workflow.
The Target
The first target of our scraping will again be the Smith Statistical & Data Sciences webpage. However, instead of stopping with the information on the home page, we will teach our scraper to “crawl” through the site and visit multiple pages to build our dataset.
We will eventually want to visit each faculty page (here is mine, for example), and combine the information on that page (email, office hours, etc.) with what we already have from the home page (name, title, links). Once we have code to do that, we can expand from one program to several.
Homepage Code
Let’s recap on the code we wrote for Monday. I’ve provided a copy of code that works below. It first gets the program homepage into R, makes a dataframe with a row for each faculty member, then adds some columns for their titles and a relative link to their personal pages.
name | title | link |
---|---|---|
Ben Baumer | Associate Professor of Statistical & Data Sciences | /academics/faculty/ben-baumer |
Shiya Cao | MassMutual Assistant Professor of Statistical and Data Sciences | /academics/faculty/shiya-cao |
Kaitlyn Cook | Assistant Professor of Statistical & Data Sciences | /academics/faculty/kaitlyn-cook |
Rosie Dutt | Lecturer in Statistical and Data Sciences & Computer Science | /academics/faculty/rosie-dutt |
Randi L. Garcia | Associate Professor of Psychology and of Statistical & Data Sciences | /academics/faculty/randi-garcia |
Katherine Halvorsen | Professor Emerita of Mathematics & Statistics | /academics/faculty/katherine-halvorsen |
Will Hopper | Lecturer in Statistical & Data Sciences | /academics/faculty/will-hopper |
Nicholas Horton | Research Associate in Statistical & Data Sciences | /academics/faculty/nicholas-horton |
Jared Joseph | Visiting Assistant Professor of Statistical and Data Sciences | /academics/faculty/jared-joseph |
Albert Young-Sun Kim | Assistant Professor of Statistical & Data Sciences | /academics/faculty/albert-kim |
Katherine M. Kinnaird | Clare Boothe Luce Assistant Professor of Computer Science and of Statistical & Data Sciences | /academics/faculty/katherine-kinnaird |
Scott LaCombe | Assistant Professor of Government and of Statistical & Data Sciences | /academics/faculty/scott-lacombe |
Lindsay Poirier | Assistant Professor of Statistics & Data Sciences | /academics/faculty/lindsay-poirier |
Nutcha Wattanachit | UMass Teaching Associate, Statistical and Data Sciences | /academics/faculty/faculty-nutcha-wattanachit |
Faith Zhang | Lecturer of Statistical and Data Sciences | /academics/faculty/faculty-faith-zhang |
Take the code from Wednesday’s worksheet, and turn it into a function called scrape_homepage
. This function should accept an argument called url
which will be the URL of the program homepage we want to scrape. It should output the sds_faculty
dataframe as above.
Then, use your new function to scrape the SDS homepage into an object called sds_faculty
.
# <REPLACE THIS COMMENT WITH YOR ANSWER>
Digging Deeper
Now that we have the homepage data again, we’re going to dig a little deeper and follow those faculty links. We want to scrape each individual faculty member page to get the faculty member’s email and office hours info.
Write code to iterate through the link
column of sds_faculty
and scrape email addresses and office hours info. We will incorporate it into our function in the next step.
USE THE TEMPLATE BELOW TO MAKE SURE YOUR BOTS ARE POLITE AND WAIT BETWEEN EACH PAGE.
The Sys.sleep(10)
function will make R wait 10 seconds between each page, just like the https://www.smith.edu/robots.txt
file asks us to. It will also mean the code will take 10 * # faculty seconds to run, so don’t be surprised if it takes a bit.
When getting your selector targets from faculty pages, you will probably have to try multiple versions before your code will work for all faculty. Try selecting the same info on multiple pages to find the ones that work for all. No way around it, web scraping is just messy.
# <REPLACE THIS COMMENT WITH YOR ANSWER>
# Template
#for(x in y){
#
# # your code here
#
# # wait the 10 seconds requested by robots.txt
# Sys.sleep(10)
#
#}
Tying it Together
Now that we have code to go to each individual faculty page, let’s package that up nicely as well.
Incorporate the code from the previous section into our scrape_homepage
function such that you can give the function a program homepage URL, and it will return a dataframe called dept_output
with faculty names, titles, page URL, email, and office hour information (whatever they have) for all faculty in that program. Make this expanded information togglable with an argument to our function.
# <REPLACE THIS COMMENT WITH YOR ANSWER>
Once you have your function working, try pointing it an another department homepage!
Enriching Our Data
We now have a function that can intake an department’s homepage URL, and get info on all the faculty. Something else that may be nice to know is what classes each faculty member is teaching this year. The problem is there is no clean section of the page with nice selectors for us to scrape, we’ll need to take all the classes text and use RegEx to clean it.
Look on the SDS page and click the drop down for “Current Offerings.” This section contains who is teaching what classes this year. We’ll need to figure out how to clean this data and add it to out output. We are not going to add this into our function. This is because the function will currently work on any Smith department website, while what we do here is specific to SDS, given not every department lists their current offerings in the same way.
I’ve provided some code to get you started below.
Use RegEx to clean the course offerings this year, and add the results to our dept_output
from above. You do not need to do this within the function (and shouldn’t as that would make our function less flexible given it would only work for SDS).
Your final result should be a new column in dept_output
which lists all the classes that faculty member is teaching this year.
# load in stringr for regex text tools
library(stringr)
# get dept_output if you have not
= scrape_homepage("https://www.smith.edu/academics/statistics", faculty_info = TRUE)
dept_output
# get the full page data
= read_html('https://www.smith.edu/academics/statistics')
program_page
# grab the class blob from the program page
= html_text2(html_elements(program_page, '#academics-statistical-data-sciences-current-offerings .panel-body'))
class_blob
# <REPLACE THIS COMMENT WITH YOR ANSWER>