flowchart TD A[Get PDF] --> B B[OCR if Needed] B --> C B --> D C[Text\nBlob] D[Page\nLocation] linkStyle 0 stroke:white linkStyle 1 stroke:white linkStyle 2 stroke:white
Spring 2023
Smith College
To understand the PDF format and how to work around the limitations.
Portable Document Format (PDF) files are the de facto standard in sharing digital documents.
The standard was created to ensure the content of a file looked the same on every device it was opened on.
PDFs accomplish this by storing all the fonts, pictures, or whatever else they need inside of them.
However, the format was developed with no consideration toward data access.
PDFs work by storing arbitrary contents at X/Y coordinates on a page.
Often PDFs are scanned documents, turning the whole thing into a single image.
Optical Character Recognition (OCR) can be used to convert images of text into digital text.
The computer will try to make it’s best guess of what each character on the page is.
This is a pretty difficult task, and relies on the clarity of the image.
🤷
You pretty much only work with PDFs if you have to.
But because of their ubiquity, you often have to.
PDFs are used for:
To do any work with PDFs, first you need to have a version of it with digital contents.
After you have that, there are two broad strategies.
flowchart TD A[Get PDF] --> B B[OCR if Needed] B --> C B --> D C[Text\nBlob] D[Page\nLocation] linkStyle 0 stroke:white linkStyle 1 stroke:white linkStyle 2 stroke:white
I needed to convert 20 years worth of government reports from ~600 page PDFs to tables for analysis.
No way I was going to do it by hand; adviser said pay a small army or undergrads.
I wrote code instead. It still was a massive undertaking, but created the data set my dissertation is based on.
[1] " State of California – Asset Forfeiture Report 2020\n\n Table 3\n\nAdmin Number / Amount Date Amount\n City Suspects Offenses Disposition Recipient\nDocket Number Forfeited Disbursed Disbursed\n\n\nSOLANO\n19-3865 Martinez 1 4 Plea Agreement $ 941.31 07/20/2020\n\n CDAA $ 9.41\n CHP $ 611.85\n DA OFFICE $ 94.13\n GENERAL FUND $ 225.92\n\n19-3882 Fairfield 0 0 No Charge $ 939.61 02/13/2020\n\n CDAA $ 9.40\n DA OFFICE $ 93.96\n FAIRFIELD PD $ 610.75\n GENERAL FUND $ 225.50\n\n19-3883 Fairfield 0 0 No Charge $ 2,492.36 02/13/2020\n\n CDAA $ 24.92\n DA OFFICE $ 249.24\n FAIRFIELD PD $ 1,620.03\n GENERAL FUND $ 598.17\n\n19-3884 Fairfield 1 3 Dropped $ 1,009.34 02/13/2020\n Charges\n CDAA $ 10.09\n DA OFFICE $ 100.94\n FAIRFIELD PD $ 656.07\n GENERAL FUND $ 242.24\n\n19-3887 Rio Vista 1 5 Other $ 4,277.43 07/30/2020\n\n CDAA $ 42.78\n DA OFFICE $ 427.74\n GENERAL FUND $ 1,026.58\n VACAVILLE PD $ 2,780.33\n\n19-3889 Vacaville 1 4 Other $ 4,864.29 07/30/2020\n\n CDAA $ 48.64\n DA OFFICE $ 486.43\n GENERAL FUND $ 1,167.43\n VACAVILLE PD $ 3,161.79\n\nSOLANO $ 14,524.34\n\n\n\n\n 249\n"
[[1]]
[,1]
[1,] ""
[2,] "Admin Number /\rDocket NumberCitySuspectsOffensesDispositionAmountDateAmount\rForfeitedDisbursedRecipient\rDisbursed\rSOLANO\r19-3865Martinez14Plea Agreement$ 941.3107/20/2020\rCDAA$ 9.41\rCHP$ 611.85\rDA OFFICE$ 94.13\rGENERAL FUND$ 225.92\r19-3882Fairfield00No Charge$ 939.6102/13/2020\rCDAA$ 9.40\rDA OFFICE$ 93.96\rFAIRFIELD PD$ 610.75\rGENERAL FUND$ 225.50\r19-3883Fairfield00No Charge$ 2,492.3602/13/2020\rCDAA$ 24.92\rDA OFFICE$ 249.24\rFAIRFIELD PD$ 1,620.03\rGENERAL FUND$ 598.17\r19-3884Fairfield13Dropped$ 1,009.3402/13/2020\rCharges\rCDAA$ 10.09\rDA OFFICE$ 100.94\rFAIRFIELD PD$ 656.07\rGENERAL FUND$ 242.24\r19-3887Rio Vista15Other$ 4,277.4307/30/2020\rCDAA$ 42.78\rDA OFFICE$ 427.74\rGENERAL FUND$ 1,026.58\rVACAVILLE PD$ 2,780.33\r19-3889Vacaville14Other$ 4,864.2907/30/2020\rCDAA$ 48.64\rDA OFFICE$ 486.43\rGENERAL FUND$ 1,167.43\rVACAVILLE PD$ 3,161.79\rSOLANO\r$ 14,524.34"
[,2] [,3] [,4] [,5] [,6]
[1,] "" "" "" "" ""
[2,] "Admin Number /\rDocket Number" "City" "Suspects" "" "Offenses"
[,7] [,8] [,9] [,10] [,11] [,12]
[1,] "" "" "" "" "" ""
[2,] "Disposition" "" "Amount\rForfeited" "" "Date\rDisbursed" ""
[,13] [,14]
[1,] "" ""
[2,] "Recipient" "Amount\rDisbursed"
[1] " State of California – Asset Forfeiture Report 2020\n\n Table 3\n\nAdmin Number / Amount Date Amount\n City Suspects Offenses Disposition Recipient\nDocket Number Forfeited Disbursed Disbursed\n\n\nSOLANO\n19-3865 Martinez 1 4 Plea Agreement $ 941.31 07/20/2020\n\n CDAA $ 9.41\n CHP $ 611.85\n DA OFFICE $ 94.13\n GENERAL FUND $ 225.92\n\n19-3882 Fairfield 0 0 No Charge $ 939.61 02/13/2020\n\n CDAA $ 9.40\n DA OFFICE $ 93.96\n FAIRFIELD PD $ 610.75\n GENERAL FUND $ 225.50\n\n19-3883 Fairfield 0 0 No Charge $ 2,492.36 02/13/2020\n\n CDAA $ 24.92\n DA OFFICE $ 249.24\n FAIRFIELD PD $ 1,620.03\n GENERAL FUND $ 598.17\n\n19-3884 Fairfield 1 3 Dropped $ 1,009.34 02/13/2020\n Charges\n CDAA $ 10.09\n DA OFFICE $ 100.94\n FAIRFIELD PD $ 656.07\n GENERAL FUND $ 242.24\n\n19-3887 Rio Vista 1 5 Other $ 4,277.43 07/30/2020\n\n CDAA $ 42.78\n DA OFFICE $ 427.74\n GENERAL FUND $ 1,026.58\n VACAVILLE PD $ 2,780.33\n\n19-3889 Vacaville 1 4 Other $ 4,864.29 07/30/2020\n\n CDAA $ 48.64\n DA OFFICE $ 486.43\n GENERAL FUND $ 1,167.43\n VACAVILLE PD $ 3,161.79\n\nSOLANO $ 14,524.34\n\n\n\n\n 249\n"
tabulizer::extract_tables(
here::here("./week_6/pdf_data/2020_solano.pdf"),
area = list(c("top" = 171.95982, "left" = 76.13792,
"bottom" = 520.42466, "right" = 425.10342)),
guess = FALSE)
[[1]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "19-3865" "Martinez" "1" "4" "Plea Agreement" "$ 941.31" "07/20/2020"
[2,] "19-3882" "Fairfield" "0" "0" "No Charge" "$ 939.61" "02/13/2020"
[3,] "19-3883" "Fairfield" "0" "0" "No Charge" "$ 2,492.36" "02/13/2020"
[4,] "19-3884" "Fairfield" "1" "3" "Dropped" "$ 1,009.34" "02/13/2020"
[5,] "" "" "" "" "Charges" "" ""
[6,] "19-3887" "Rio Vista" "1" "5" "Other" "$ 4,277.43" "07/30/2020"
[7,] "19-3889" "Vacaville" "1" "4" "Other" "$ 4,864.29" "07/30/2020"
Almost all OCR in R is done using the tesseract
package.
Other packages may provide helper function to make using it easier.
Based on the now open source Google Tesseract code.
[1] "This is a lot of 12 point text to test the\nocr code and see if it works on all types\nof file format.\n\nThe quick brown dog jumped over the\nlazy fox. The quick brown dog jumped\nover the lazy fox. The quick brown dog\njumped over the lazy fox. The quick\nbrown dog jumped over the lazy fox.\n"
The pdftools
package provides a number of functions for working with PDFs.
You can get some metadata using the pdf_info()
function.
You can get the actual contents of the PDF using pdf_text()
, but it will all be one giant blob.
$Author
[1] "CA DOJ"
$Creator
[1] "Adobe InDesign 16.1 (Macintosh)"
$Keywords
[1] "2020, Asset, Forfeiture, Report,"
[1] " State of California – Asset Forfeiture Report 2020\n\n Table 3\n\nAdmin Number / Amount Date Amount\n City Suspects Offenses Disposition Recipient\nDocket Number Forfeited Disbursed Disbursed\n\n\nSOLANO\n19-3865 Martinez 1 4 Plea Agreement $ 941.31 07/20/2020\n\n CDAA $ 9.41\n CHP $ 611.85\n DA OFFICE $ 94.13\n"
The tabulizer
package lets you interactively extract text from specific areas on a page in a PDF.
It also has tools to automatically extract data from (well formatted) tables.
Even if you aren’t after tables, being able to get text from specific areas is helpful.
[,1] [,2] [,3] [,4] [,5]
[1,] "19-3865" "Martinez" "1" "4" "Plea Agreement"
[2,] "19-3882" "Fairfield" "0" "0" "No Charge"
[3,] "19-3883" "Fairfield" "0" "0" "No Charge"
[4,] "19-3884" "Fairfield" "1" "3" "Dropped"
[5,] "" "" "" "" "Charges"
[6,] "19-3887" "Rio Vista" "1" "5" "Other"
[7,] "19-3889" "Vacaville" "1" "4" "Other"
Lab 5, Quiz 2 Open
SDS 270: Advanced Programming for Data Science