Lab 4. Web Scraping & RegEx

Author

Dr. Jared Joseph

Introduction

Click here to access the lab on Github Classroom: Github Classroom Assignment for Lab 4: Web Scraping & RegEx

Web scraping can give you the ability to create your own datasets from online resources. While getting information from one page–like our worksheet from Monday–can be helpful, it is really only the first step of building a proper dataset. Today we will be continuing our work scraping the Smith website, with the goal of practicing a full web scraping workflow.

The Target

The first target of our scraping will again be the Smith Statistical & Data Sciences webpage. However, instead of stopping with the information on the home page, we will teach our scraper to “crawl” through the site and visit multiple pages to build our dataset.

We will eventually want to visit each faculty page (here is mine, for example), and combine the information on that page (email, office hours, etc.) with what we already have from the home page (name, title, links). Once we have code to do that, we can expand from one program to several.

Homepage Code

Let’s recap on the code we wrote for Monday. I’ve provided a copy of code that works below. It first gets the program homepage into R, makes a dataframe with a row for each faculty member, then adds some columns for their titles and a relative link to their personal pages.

library(rvest)

# get the home page into R
sds_page = read_html('https://www.smith.edu/academics/statistics')

# make a dataframe with all the names
sds_faculty = data.frame('name' = html_text2(html_elements(sds_page, '.fac-inset h3')))

# add titles to the dataframe
sds_faculty$title = html_text2(html_elements(sds_page, '.fac-inset p'))

# get the relative links to each faculty page
sds_faculty$link = html_attr(html_elements(sds_page, '.linkopacity'), name = 'href')

name	title	link
Ben Baumer	Associate Professor of Statistical & Data Sciences	/academics/faculty/ben-baumer
Shiya Cao	MassMutual Assistant Professor of Statistical and Data Sciences	/academics/faculty/shiya-cao
Kaitlyn Cook	Assistant Professor of Statistical & Data Sciences	/academics/faculty/kaitlyn-cook
Rosie Dutt	Lecturer in Statistical and Data Sciences & Computer Science	/academics/faculty/rosie-dutt
Randi L. Garcia	Associate Professor of Psychology and of Statistical & Data Sciences	/academics/faculty/randi-garcia
Katherine Halvorsen	Professor Emerita of Mathematics & Statistics	/academics/faculty/katherine-halvorsen
Will Hopper	Lecturer in Statistical & Data Sciences	/academics/faculty/will-hopper
Nicholas Horton	Research Associate in Statistical & Data Sciences	/academics/faculty/nicholas-horton
Jared Joseph	Visiting Assistant Professor of Statistical and Data Sciences	/academics/faculty/jared-joseph
Albert Young-Sun Kim	Assistant Professor of Statistical & Data Sciences	/academics/faculty/albert-kim
Katherine M. Kinnaird	Clare Boothe Luce Assistant Professor of Computer Science and of Statistical & Data Sciences	/academics/faculty/katherine-kinnaird
Scott LaCombe	Assistant Professor of Government and of Statistical & Data Sciences	/academics/faculty/scott-lacombe
Lindsay Poirier	Assistant Professor of Statistics & Data Sciences	/academics/faculty/lindsay-poirier
Nutcha Wattanachit	UMass Teaching Associate, Statistical and Data Sciences	/academics/faculty/faculty-nutcha-wattanachit
Faith Zhang	Lecturer of Statistical and Data Sciences	/academics/faculty/faculty-faith-zhang

Question 1

Take the code from Wednesday’s worksheet, and turn it into a function called scrape_homepage. This function should accept an argument called url which will be the URL of the program homepage we want to scrape. It should output the sds_faculty dataframe as above.

Then, use your new function to scrape the SDS homepage into an object called sds_faculty.

# <REPLACE THIS COMMENT WITH YOR ANSWER>

Digging Deeper

Now that we have the homepage data again, we’re going to dig a little deeper and follow those faculty links. We want to scrape each individual faculty member page to get the faculty member’s email and office hours info.

Question 2

Write code to iterate through the link column of sds_faculty and scrape email addresses and office hours info. We will incorporate it into our function in the next step.

USE THE TEMPLATE BELOW TO MAKE SURE YOUR BOTS ARE POLITE AND WAIT BETWEEN EACH PAGE.

The Sys.sleep(10) function will make R wait 10 seconds between each page, just like the https://www.smith.edu/robots.txt file asks us to. It will also mean the code will take 10 * # faculty seconds to run, so don’t be surprised if it takes a bit.

Tip

When getting your selector targets from faculty pages, you will probably have to try multiple versions before your code will work for all faculty. Try selecting the same info on multiple pages to find the ones that work for all. No way around it, web scraping is just messy.

# <REPLACE THIS COMMENT WITH YOR ANSWER>

# Template

#for(x in y){
#  
#  # your code here
#  
#  # wait the 10 seconds requested by robots.txt
#  Sys.sleep(10)
#  
#}

Tying it Together

Now that we have code to go to each individual faculty page, let’s package that up nicely as well.

Question 3

Incorporate the code from the previous section into our scrape_homepage function such that you can give the function a program homepage URL, and it will return a dataframe called dept_output with faculty names, titles, page URL, email, and office hour information (whatever they have) for all faculty in that program. Make this expanded information togglable with an argument to our function.

# <REPLACE THIS COMMENT WITH YOR ANSWER>

Once you have your function working, try pointing it an another department homepage!

Enriching Our Data

We now have a function that can intake an department’s homepage URL, and get info on all the faculty. Something else that may be nice to know is what classes each faculty member is teaching this year. The problem is there is no clean section of the page with nice selectors for us to scrape, we’ll need to take all the classes text and use RegEx to clean it.

Look on the SDS page and click the drop down for “Current Offerings.” This section contains who is teaching what classes this year. We’ll need to figure out how to clean this data and add it to out output. We are not going to add this into our function. This is because the function will currently work on any Smith department website, while what we do here is specific to SDS, given not every department lists their current offerings in the same way.

I’ve provided some code to get you started below.

CHALLANGE QUESTION

Use RegEx to clean the course offerings this year, and add the results to our dept_output from above. You do not need to do this within the function (and shouldn’t as that would make our function less flexible given it would only work for SDS).

Your final result should be a new column in dept_output which lists all the classes that faculty member is teaching this year.

# load in stringr for regex text tools
library(stringr)

# get dept_output if you have not
dept_output = scrape_homepage("https://www.smith.edu/academics/statistics", faculty_info = TRUE)

# get the full page data
program_page = read_html('https://www.smith.edu/academics/statistics')

# grab the class blob from the program page
class_blob = html_text2(html_elements(program_page, '#academics-statistical-data-sciences-current-offerings .panel-body'))

# <REPLACE THIS COMMENT WITH YOR ANSWER>