Day 11 - Web Scraping

Spring 2023

Smith College

Overview

Timeline

  • What is Web Scraping
  • The Legalities of Web Scraping
  • The Ethics of Web Scraping
  • The Tools & Limits

Goal

Build a bespoke scraper for the SDS website and understand how it works.

What is Web Scraping

A Quick Definition

Web scraping refers to the process of programmatically collecting data from the internet, typically from web pages.


This is often done in a way not intended by website owners.


Web scraping can be useful, but must be used responsibly.

graph TD
  A[Website] --> B{R}
  B --> C[Dataset]
  
    linkStyle 0 stroke:white
    linkStyle 1 stroke:white

A Word of Warning




If you screw this up, you can get the entire university banned from a website.

An Analogy

Web scraping is like going to an event, eating the hors d’oeuvres, and leaving.


The event (website) wants you to stick around, look at some ads, maybe buy something.


If you take your food politely, you’re probably fine. If you’re a jerk, you’re going to cause a scene.

Chris Schaer

A Use Case

Say you want to see where Smith faculty got their degrees; see the most common, compare Ivy vs not, etc.


You could:

  1. Go to every faculty page on the smith website and copy/paste the info into a spreadsheet.

OR

  1. Write some code to do all that for you.

The Legalities of Web Scraping

(General ideas, I’m not a lawyer; not legal advice, etc.)

A Primer

The legalities of web scraping are in a grey area. The legalities depend on several factors including:

  • The kind of data you are trying to get
  • How you are getting and saving it
  • What you plan to do with the data once you have it

In general, you should never scrape:

  • Anything under copywrite
  • Anything about private people
  • Anything you need to log-in to see

Terms of Service

Most Terms of Service (ToS) will explicitly state if they dis/allow scraping.


The ToS is often in the footer, or bottom, of the home page.


Look out for terms like:

  • automated
  • bot
  • scrape
  • crawl

Robots.txt

Most sites also have a robots.txt file.


You can find this file by going to the top page of a website and adding robots.txt to the URL.


It often has detailed rules on what pages you can scrape, and how often. For example, the Smith website wants bots to pause for 10 seconds between pages, and lists many that are disallowed.

The Ethics of Web Scraping

The T3 Dataset

The Tastes, Ties, and Time (T3) Dataset contained the following for nearly an entire cohort (grade level) of students, following them every year from 2006 to 2009:

  • Race
  • Gender
  • Political views
  • Home state/country
  • Major
  • Relationships
  • Official housing records

This data was collected direct from the university and Facebook with no input from the students.

“With permission from Facebook and the university in question, we first accessed Facebook on March 10 and 11, 2006 and downloaded the profile and network data provided by one cohort of college students. This population, the freshman class of 2009 at a diverse private college in the Northeast U.S., has an exceptionally high participation rate on Facebook: of the 1640 freshmen students enrolled at the college, 97.4% maintained Facebook profiles at the time of download and 59.2% of these students had last updated their profile within 5 days. (Lewis et al. 2008, p. 331)”

Serious Concerns

“The ‘‘non-identifiability’’ of such a dataset is up for debate. A friend network can be thought of as a fingerprint; it is likely that no two networks will be exactly similar, meaning individuals may be able to be iden- tified in the dataset post-hoc… Further, the authors of the dataset plan to release student ‘‘Favorite’’ data in 2011, which will provide further information that may lead to identification. (Stutzman 2008)”

“I think it’s hard to imagine that some of this anonymity wouldn’t be breached with some of the participants in the sample. For one thing, some nationalities are only represented by one person. Another issue is that the particular list of majors makes it quite easy to guess which specific school was used to draw the sample. Put those two pieces of information together and I can imagine all sorts of identities becoming rather obvious to at least some people. (Hargittai 2008)”

“The Data was Already Public”

The T3 authors said that:

“We have not accessed any information not otherwise available on Facebook”

The T3 project used student research assistants (some who had privileged access to other students networks through mutual or direct friendships) to collect all the data.


Given that, was the data really public?


Even if it was, people did not expect for their data to be collected and aggregated in this way.

The Tools & Limits

Simple Vs. Complex Pages

Simple

Static pages that display pre-set content.

Complex

Dynamic pages that update given user input or other factors.

A Primer on HTML

HTML (HyperText Markup Language) is the language used to build pretty much everything you see on the web.


Very generally, it creates sections on a web page, and you can apply certain properties to that section. For example, the color, size, and style of text. We can use that section structure to get the data we want.


Fun fact, these slides are all HTML! That’s why you view them in a web browser.

<html>
<head>
  <title>Page title</title>
</head>
<body>
  <h1>Heading 1</h1>
  <p>Hello world! <b>Bold Hello world!</b></p>
  <a href='jnjoseph.com'>I am a link!</a>
</body>

SelectorGadget

SelectorGadget helps us isolate those elements we want from a web page. In this case, faculty names from the SDS page.

rvest

The rvest package simplifies a lot of basic web scraping.


We can give rvest the element IDs from SelectorGadget to easily compile the data we want.


We can use this process to make our own data from information on the web.

Scrape the Data

Here I’ll grab the names for all the faculty on the SDS page.


I first read the entire page into R using read_html() from rvest.


After that, I essentially subset that web page using the HTML sections I got from SelectorGadget.


Note: If the website ever changes significantly, our scraper code will probably break!

```{r}
library(rvest)

# download the SDS page
sds_page = read_html(
  "https://www.smith.edu/academics/statistics")

# what is it?
class(sds_page)
```
[1] "xml_document" "xml_node"    
```{r}
# Get the names of all SDS faculty
html_text2(
  html_elements(sds_page, ".fac-inset h3")
  )
```
 [1] "Ben Baumer"            "Shiya Cao"             "Kaitlyn Cook"         
 [4] "Rosie Dutt"            "Randi L. Garcia"       "Katherine Halvorsen"  
 [7] "Will Hopper"           "Nicholas Horton"       "Jared Joseph"         
[10] "Albert Young-Sun Kim"  "Katherine M. Kinnaird" "Scott LaCombe"        
[13] "Lindsay Poirier"       "Nutcha Wattanachit"    "Faith Zhang"          

Make a Dataframe

We can repeat this process to create a whole dataframe of information!

```{r}
sds_df = data.frame(
  "name" = html_text2(html_elements(sds_page, ".fac-inset h3")),
  "title" = html_text2(html_elements(sds_page, ".fac-inset p")),
  "rel_link" = html_attr(html_elements(sds_page, ".linkopacity"), name = "href")
)
```
name title rel_link
Ben Baumer Associate Professor of Statistical & Data Sciences /academics/faculty/ben-baumer
Shiya Cao MassMutual Assistant Professor of Statistical and Data Sciences /academics/faculty/shiya-cao
Kaitlyn Cook Assistant Professor of Statistical & Data Sciences /academics/faculty/kaitlyn-cook
Rosie Dutt Lecturer in Statistical and Data Sciences & Computer Science /academics/faculty/rosie-dutt
Randi L. Garcia Associate Professor of Psychology and of Statistical & Data Sciences /academics/faculty/randi-garcia
Katherine Halvorsen Professor Emerita of Mathematics & Statistics /academics/faculty/katherine-halvorsen
Will Hopper Lecturer in Statistical & Data Sciences /academics/faculty/will-hopper
Nicholas Horton Research Associate in Statistical & Data Sciences /academics/faculty/nicholas-horton
Jared Joseph Visiting Assistant Professor of Statistical and Data Sciences /academics/faculty/jared-joseph
Albert Young-Sun Kim Assistant Professor of Statistical & Data Sciences /academics/faculty/albert-kim
Katherine M. Kinnaird Clare Boothe Luce Assistant Professor of Computer Science and of Statistical & Data Sciences /academics/faculty/katherine-kinnaird
Scott LaCombe Assistant Professor of Government and of Statistical & Data Sciences /academics/faculty/scott-lacombe
Lindsay Poirier Assistant Professor of Statistics & Data Sciences /academics/faculty/lindsay-poirier
Nutcha Wattanachit UMass Teaching Associate, Statistical and Data Sciences /academics/faculty/faculty-nutcha-wattanachit
Faith Zhang Lecturer of Statistical and Data Sciences /academics/faculty/faculty-faith-zhang

Now Iterate

We can now get data for individual web pages.


We also have a dataframe column with links to all the specific faculty pages.


We could iterate over those links to go to each of the pages and get more information.

This is where the danger is!


If we program a bot to go to more pages, it will do so as fast as possible unless we tell it otherwise.


Good bots take breaks so as to not overload the website. You can do that in R using Sys.sleep().

Code-Along (ish)

For Next Time

Topic

RegEx

To-Do

  • Finish Worksheet