Lab 5. PDFs and Parallel

Author

Dr. Jared Joseph

Introduction

Click here to access the lab on Github Classroom: Github Classroom Assignment for Lab 5: PDFs & Parallel

For lab today, we will be getting data out of the Massachusetts COVID-19 vaccination reports. While there is a CSV database of vaccination data for MA, it only shows the current running total. What we want to know for this project is how many vaccines were given out on each specific day. Unfortunately for us, this data is only available through the PDF reports. Further complicating things is the fact that the vaccine reports are only published weekly, meaning we will need to extract information from several PDFs to get any sort of useful database.

I have included a sampling of the reports published September 2022 - February 2023 in the data directory of this lab. Be sure to look them over before starting to code. We are going to want to take the data present in the figures on page 3, and combine that data from all PDFs into a single dataframe. Here is an example of the final output (we can ignore years in the output):

date vacs file
Dec 01 13743 ./data/weekly-covid-19-vaccination-report-1-4-2023.pdf
Dec 11 4988 ./data/weekly-covid-19-vaccination-report-1-11-2023.pdf
Dec 21 9765 ./data/weekly-covid-19-vaccination-report-1-11-2023.pdf
Dec 31 3279 ./data/weekly-covid-19-vaccination-report-1-11-2023.pdf
Jan 05 7549 ./data/Weekly-COVID-19-Vaccination-report-1-18-23_0.pdf
Jan 15 1602 ./data/weekly-covid-19-vaccination-report-1-25-2023.pdf
Jan 25 3655 ./data/weekly-covid-19-vaccination-report-2-15-2023.pdf

The lab is only 1 question long, but that does not mean you should jump right into writing code to iterate over all the reports. I encourage you to follow the scaffolding I made explicit in previous assignments and develop your code in the following steps:

  1. Write code to extract the vaccine numbers per day from one PDF
  2. Turn that code into a function
  3. Iterate that function over several PDFs, combine outputs, and clean
  4. (CHALLENGE) Parallelize your code
Question 1

Write code to extract the desired data from all of the PDF vaccination reports provided in this lab. Your output should look like the example provided above.

Tip

If using tabulizer for the task, rather than using extract_areas() I strongly recommend using locate_areas() and then manually assigning those coordinates to extract_tables() using the area argument. That way, you do not need to interactively draw the box around your areas for every document! It would look like this:

# Get the coordinates interactivly once, then add them to code
locate_areas(XXXX) # comment me out once you have the coordinates
chart_coords = list(c("top" = 20, "left" = 100, "bottom" = 80, "right" = 200))

output = extract_tables(XXXX, area = chart_coords, XXXX)
# <REPLACE THIS COMMENT WITH YOR ANSWER>