# <REPLACE THIS COMMENT WITH YOR ANSWER>
Lab 5. PDFs and Parallel
Introduction
Click here to access the lab on Github Classroom: Github Classroom Assignment for Lab 5: PDFs & Parallel
For lab today, we will be getting data out of the Massachusetts COVID-19 vaccination reports. While there is a CSV database of vaccination data for MA, it only shows the current running total. What we want to know for this project is how many vaccines were given out on each specific day. Unfortunately for us, this data is only available through the PDF reports. Further complicating things is the fact that the vaccine reports are only published weekly, meaning we will need to extract information from several PDFs to get any sort of useful database.
I have included a sampling of the reports published September 2022 - February 2023 in the data directory of this lab. Be sure to look them over before starting to code. We are going to want to take the data present in the figures on page 3, and combine that data from all PDFs into a single dataframe. Here is an example of the final output (we can ignore years in the output):
date | vacs | file |
---|---|---|
Dec 01 | 13743 | ./data/weekly-covid-19-vaccination-report-1-4-2023.pdf |
Dec 11 | 4988 | ./data/weekly-covid-19-vaccination-report-1-11-2023.pdf |
Dec 21 | 9765 | ./data/weekly-covid-19-vaccination-report-1-11-2023.pdf |
Dec 31 | 3279 | ./data/weekly-covid-19-vaccination-report-1-11-2023.pdf |
Jan 05 | 7549 | ./data/Weekly-COVID-19-Vaccination-report-1-18-23_0.pdf |
Jan 15 | 1602 | ./data/weekly-covid-19-vaccination-report-1-25-2023.pdf |
Jan 25 | 3655 | ./data/weekly-covid-19-vaccination-report-2-15-2023.pdf |
The lab is only 1 question long, but that does not mean you should jump right into writing code to iterate over all the reports. I encourage you to follow the scaffolding I made explicit in previous assignments and develop your code in the following steps:
- Write code to extract the vaccine numbers per day from one PDF
- Turn that code into a function
- Iterate that function over several PDFs, combine outputs, and clean
- (CHALLENGE) Parallelize your code