library(rvest)
# get the home page into R
= read_html('https://www.smith.edu/academics/statistics')
sds_page
# make a dataframe with all the names
= data.frame('name' = html_text2(html_elements(sds_page, '.fac-inset h3')))
sds_faculty
# add titles to the dataframe
$title = html_text2(html_elements(sds_page, '.fac-inset p'))
sds_faculty
# get the relative links to each faculty page
$link = html_attr(html_elements(sds_page, '.linkopacity'), name = 'href') sds_faculty
Lab 4. Web Scraping & RegEx
Introduction
Click here to access the lab on Github Classroom: Github Classroom Assignment for Lab 4: Web Scraping & RegEx
Web scraping can give you the ability to create your own datasets from online resources. While getting information from one page–like our worksheet from Monday–can be helpful, it is really only the first step of building a proper dataset. Today we will be continuing our work scraping the Smith website, with the goal of practicing a full web scraping workflow.
The Target
The first target of our scraping will again be the Smith Statistical & Data Sciences webpage. However, instead of stopping with the information on the home page, we will teach our scraper to “crawl” through the site and visit multiple pages to build our dataset.
We will eventually want to visit each faculty page (here is mine, for example), and combine the information on that page (email, office hours, etc.) with what we already have from the home page (name, title, links). Once we have code to do that, we can expand from one program to several.
Homepage Code
Let’s recap on the code we wrote for Monday. I’ve provided a copy of code that works below. It first gets the program homepage into R, makes a dataframe with a row for each faculty member, then adds some columns for their titles and a relative link to their personal pages.
name | title | link |
---|---|---|
Ben Baumer | Associate Professor of Statistical & Data Sciences | /academics/faculty/ben-baumer |
Shiya Cao | MassMutual Assistant Professor of Statistical and Data Sciences | /academics/faculty/shiya-cao |
Kaitlyn Cook | Assistant Professor of Statistical & Data Sciences | /academics/faculty/kaitlyn-cook |
Rosie Dutt | Lecturer in Statistical and Data Sciences & Computer Science | /academics/faculty/rosie-dutt |
Randi L. Garcia | Associate Professor of Psychology and of Statistical & Data Sciences | /academics/faculty/randi-garcia |
Katherine Halvorsen | Professor Emerita of Mathematics & Statistics | /academics/faculty/katherine-halvorsen |
Will Hopper | Lecturer in Statistical & Data Sciences | /academics/faculty/will-hopper |
Nicholas Horton | Research Associate in Statistical & Data Sciences | /academics/faculty/nicholas-horton |
Jared Joseph | Visiting Assistant Professor of Statistical and Data Sciences | /academics/faculty/jared-joseph |
Albert Young-Sun Kim | Assistant Professor of Statistical & Data Sciences | /academics/faculty/albert-kim |
Katherine M. Kinnaird | Clare Boothe Luce Assistant Professor of Computer Science and of Statistical & Data Sciences | /academics/faculty/katherine-kinnaird |
Scott LaCombe | Assistant Professor of Government and of Statistical & Data Sciences | /academics/faculty/scott-lacombe |
Lindsay Poirier | Assistant Professor of Statistics & Data Sciences | /academics/faculty/lindsay-poirier |
Nutcha Wattanachit | UMass Teaching Associate, Statistical and Data Sciences | /academics/faculty/faculty-nutcha-wattanachit |
Faith Zhang | Lecturer of Statistical and Data Sciences | /academics/faculty/faculty-faith-zhang |
# <REPLACE THIS COMMENT WITH YOR ANSWER>
Digging Deeper
Now that we have the homepage data again, we’re going to dig a little deeper and follow those faculty links. We want to scrape each individual faculty member page to get the faculty member’s email and office hours info.
# <REPLACE THIS COMMENT WITH YOR ANSWER>
# Template
#for(x in y){
#
# # your code here
#
# # wait the 10 seconds requested by robots.txt
# Sys.sleep(10)
#
#}
Tying it Together
Now that we have code to go to each individual faculty page, let’s package that up nicely as well.
# <REPLACE THIS COMMENT WITH YOR ANSWER>
Once you have your function working, try pointing it an another department homepage!
Enriching Our Data
We now have a function that can intake an department’s homepage URL, and get info on all the faculty. Something else that may be nice to know is what classes each faculty member is teaching this year. The problem is there is no clean section of the page with nice selectors for us to scrape, we’ll need to take all the classes text and use RegEx to clean it.
Look on the SDS page and click the drop down for “Current Offerings.” This section contains who is teaching what classes this year. We’ll need to figure out how to clean this data and add it to out output. We are not going to add this into our function. This is because the function will currently work on any Smith department website, while what we do here is specific to SDS, given not every department lists their current offerings in the same way.
I’ve provided some code to get you started below.
# load in stringr for regex text tools
library(stringr)
# get dept_output if you have not
= scrape_homepage("https://www.smith.edu/academics/statistics", faculty_info = TRUE)
dept_output
# get the full page data
= read_html('https://www.smith.edu/academics/statistics')
program_page
# grab the class blob from the program page
= html_text2(html_elements(program_page, '#academics-statistical-data-sciences-current-offerings .panel-body'))
class_blob
# <REPLACE THIS COMMENT WITH YOR ANSWER>