graph TD A[Website] --> B{R} B --> C[Dataset] linkStyle 0 stroke:white linkStyle 1 stroke:white
Spring 2023
Smith College
Build a bespoke scraper for the SDS website and understand how it works.
Web scraping refers to the process of programmatically collecting data from the internet, typically from web pages.
This is often done in a way not intended by website owners.
Web scraping can be useful, but must be used responsibly.
graph TD A[Website] --> B{R} B --> C[Dataset] linkStyle 0 stroke:white linkStyle 1 stroke:white
If you screw this up, you can get the entire university banned from a website.
Web scraping is like going to an event, eating the hors d’oeuvres, and leaving.
The event (website) wants you to stick around, look at some ads, maybe buy something.
If you take your food politely, you’re probably fine. If you’re a jerk, you’re going to cause a scene.
Say you want to see where Smith faculty got their degrees; see the most common, compare Ivy vs not, etc.
You could:
OR
(General ideas, I’m not a lawyer; not legal advice, etc.)
The legalities of web scraping are in a grey area. The legalities depend on several factors including:
In general, you should never scrape:
Most Terms of Service (ToS) will explicitly state if they dis/allow scraping.
The ToS is often in the footer, or bottom, of the home page.
Look out for terms like:
Most sites also have a robots.txt file.
You can find this file by going to the top page of a website and adding robots.txt to the URL.
It often has detailed rules on what pages you can scrape, and how often. For example, the Smith website wants bots to pause for 10 seconds between pages, and lists many that are disallowed.
The Tastes, Ties, and Time (T3) Dataset contained the following for nearly an entire cohort (grade level) of students, following them every year from 2006 to 2009:
This data was collected direct from the university and Facebook with no input from the students.
“With permission from Facebook and the university in question, we first accessed Facebook on March 10 and 11, 2006 and downloaded the profile and network data provided by one cohort of college students. This population, the freshman class of 2009 at a diverse private college in the Northeast U.S., has an exceptionally high participation rate on Facebook: of the 1640 freshmen students enrolled at the college, 97.4% maintained Facebook profiles at the time of download and 59.2% of these students had last updated their profile within 5 days. (Lewis et al. 2008, p. 331)”
“The ‘‘non-identifiability’’ of such a dataset is up for debate. A friend network can be thought of as a fingerprint; it is likely that no two networks will be exactly similar, meaning individuals may be able to be iden- tified in the dataset post-hoc… Further, the authors of the dataset plan to release student ‘‘Favorite’’ data in 2011, which will provide further information that may lead to identification. (Stutzman 2008)”
“I think it’s hard to imagine that some of this anonymity wouldn’t be breached with some of the participants in the sample. For one thing, some nationalities are only represented by one person. Another issue is that the particular list of majors makes it quite easy to guess which specific school was used to draw the sample. Put those two pieces of information together and I can imagine all sorts of identities becoming rather obvious to at least some people. (Hargittai 2008)”
The T3 authors said that:
“We have not accessed any information not otherwise available on Facebook”
The T3 project used student research assistants (some who had privileged access to other students networks through mutual or direct friendships) to collect all the data.
Given that, was the data really public?
Even if it was, people did not expect for their data to be collected and aggregated in this way.
Static pages that display pre-set content.
Dynamic pages that update given user input or other factors.
HTML (HyperText Markup Language) is the language used to build pretty much everything you see on the web.
Very generally, it creates sections on a web page, and you can apply certain properties to that section. For example, the color, size, and style of text. We can use that section structure to get the data we want.
Fun fact, these slides are all HTML! That’s why you view them in a web browser.
<html>
<head>
<title>Page title</title>
</head>
<body>
<h1>Heading 1</h1>
<p>Hello world! <b>Bold Hello world!</b></p>
<a href='jnjoseph.com'>I am a link!</a>
</body>
SelectorGadget
helps us isolate those elements we want from a web page. In this case, faculty names from the SDS page.
The rvest
package simplifies a lot of basic web scraping.
We can give rvest
the element IDs from SelectorGadget
to easily compile the data we want.
We can use this process to make our own data from information on the web.
Here I’ll grab the names for all the faculty on the SDS page.
I first read the entire page into R using read_html()
from rvest
.
After that, I essentially subset that web page using the HTML sections I got from SelectorGadget
.
Note: If the website ever changes significantly, our scraper code will probably break!
```{r}
library(rvest)
# download the SDS page
sds_page = read_html(
"https://www.smith.edu/academics/statistics")
# what is it?
class(sds_page)
```
[1] "xml_document" "xml_node"
```{r}
# Get the names of all SDS faculty
html_text2(
html_elements(sds_page, ".fac-inset h3")
)
```
[1] "Ben Baumer" "Shiya Cao" "Kaitlyn Cook"
[4] "Rosie Dutt" "Randi L. Garcia" "Katherine Halvorsen"
[7] "Will Hopper" "Nicholas Horton" "Jared Joseph"
[10] "Albert Young-Sun Kim" "Katherine M. Kinnaird" "Scott LaCombe"
[13] "Lindsay Poirier" "Nutcha Wattanachit" "Faith Zhang"
We can repeat this process to create a whole dataframe of information!
name | title | rel_link |
---|---|---|
Ben Baumer | Associate Professor of Statistical & Data Sciences | /academics/faculty/ben-baumer |
Shiya Cao | MassMutual Assistant Professor of Statistical and Data Sciences | /academics/faculty/shiya-cao |
Kaitlyn Cook | Assistant Professor of Statistical & Data Sciences | /academics/faculty/kaitlyn-cook |
Rosie Dutt | Lecturer in Statistical and Data Sciences & Computer Science | /academics/faculty/rosie-dutt |
Randi L. Garcia | Associate Professor of Psychology and of Statistical & Data Sciences | /academics/faculty/randi-garcia |
Katherine Halvorsen | Professor Emerita of Mathematics & Statistics | /academics/faculty/katherine-halvorsen |
Will Hopper | Lecturer in Statistical & Data Sciences | /academics/faculty/will-hopper |
Nicholas Horton | Research Associate in Statistical & Data Sciences | /academics/faculty/nicholas-horton |
Jared Joseph | Visiting Assistant Professor of Statistical and Data Sciences | /academics/faculty/jared-joseph |
Albert Young-Sun Kim | Assistant Professor of Statistical & Data Sciences | /academics/faculty/albert-kim |
Katherine M. Kinnaird | Clare Boothe Luce Assistant Professor of Computer Science and of Statistical & Data Sciences | /academics/faculty/katherine-kinnaird |
Scott LaCombe | Assistant Professor of Government and of Statistical & Data Sciences | /academics/faculty/scott-lacombe |
Lindsay Poirier | Assistant Professor of Statistics & Data Sciences | /academics/faculty/lindsay-poirier |
Nutcha Wattanachit | UMass Teaching Associate, Statistical and Data Sciences | /academics/faculty/faculty-nutcha-wattanachit |
Faith Zhang | Lecturer of Statistical and Data Sciences | /academics/faculty/faculty-faith-zhang |
We can now get data for individual web pages.
We also have a dataframe column with links to all the specific faculty pages.
We could iterate over those links to go to each of the pages and get more information.
This is where the danger is!
If we program a bot to go to more pages, it will do so as fast as possible unless we tell it otherwise.
Good bots take breaks so as to not overload the website. You can do that in R using Sys.sleep()
.
RegEx
SDS 270: Advanced Programming for Data Science