Day 15 - PDF Data Extraction

Spring 2023

Smith College

Overview

Timeline

  • The Portable Document Format (PDF)
  • Working with PDFs
  • PDF Data Extraction in R

Goal

To understand the PDF format and how to work around the limitations.

The Portable Document Format (PDF)

What are PDFs?

Portable Document Format (PDF) files are the de facto standard in sharing digital documents.


The standard was created to ensure the content of a file looked the same on every device it was opened on.


PDFs accomplish this by storing all the fonts, pictures, or whatever else they need inside of them.

The PDF Format

However, the format was developed with no consideration toward data access.


PDFs work by storing arbitrary contents at X/Y coordinates on a page.


Often PDFs are scanned documents, turning the whole thing into a single image.

Digital and Non-Digital PDFs

Digital

Flat

Optical Character Recognition

Optical Character Recognition (OCR) can be used to convert images of text into digital text.


The computer will try to make it’s best guess of what each character on the page is.


This is a pretty difficult task, and relies on the clarity of the image.

PDFs Have ~0 Labled Structure

HTML

PDF

🤷

Working with PDFs

Why?

You pretty much only work with PDFs if you have to.


But because of their ubiquity, you often have to.


PDFs are used for:

  • Government reports
  • Legal documents
  • Academic papers
  • Historical documents
  • Literature
  • Anything with a corporation
  • Much, much more.

Workflow

To do any work with PDFs, first you need to have a version of it with digital contents.


After you have that, there are two broad strategies.


  • You can work with all of the contents in one big blog, and work to filter and clean it
  • You can use the X/Y coordinates to try and segment the document

flowchart TD
    A[Get PDF] --> B
    B[OCR if Needed]
    B --> C
    B --> D
    C[Text\nBlob]
    D[Page\nLocation]
    
    linkStyle 0 stroke:white
    linkStyle 1 stroke:white
    linkStyle 2 stroke:white

PDF Data Extraction in R

Example PDF

I needed to convert 20 years worth of government reports from ~600 page PDFs to tables for analysis.


No way I was going to do it by hand; adviser said pay a small army or undergrads.


I wrote code instead. It still was a massive undertaking, but created the data set my dissertation is based on.

Approaches

Blob

pdftools::pdf_text(
  here::here("./week_6/pdf_data/2020_solano.pdf"))
[1] "                 State of California – Asset Forfeiture Report 2020\n\n                                                             Table 3\n\nAdmin Number /                                                         Amount           Date                           Amount\n                  City       Suspects   Offenses   Disposition                                       Recipient\nDocket Number                                                          Forfeited      Disbursed                       Disbursed\n\n\nSOLANO\n19-3865          Martinez       1           4      Plea Agreement         $ 941.31     07/20/2020\n\n                                                                                                    CDAA                   $ 9.41\n                                                                                                    CHP                 $ 611.85\n                                                                                                    DA OFFICE            $ 94.13\n                                                                                                    GENERAL FUND        $ 225.92\n\n19-3882          Fairfield      0           0      No Charge              $ 939.61     02/13/2020\n\n                                                                                                    CDAA                   $ 9.40\n                                                                                                    DA OFFICE            $ 93.96\n                                                                                                    FAIRFIELD PD        $ 610.75\n                                                                                                    GENERAL FUND        $ 225.50\n\n19-3883          Fairfield      0           0      No Charge             $ 2,492.36    02/13/2020\n\n                                                                                                    CDAA                 $ 24.92\n                                                                                                    DA OFFICE           $ 249.24\n                                                                                                    FAIRFIELD PD       $ 1,620.03\n                                                                                                    GENERAL FUND         $ 598.17\n\n19-3884          Fairfield      1           3      Dropped               $ 1,009.34    02/13/2020\n                                                   Charges\n                                                                                                    CDAA                 $ 10.09\n                                                                                                    DA OFFICE           $ 100.94\n                                                                                                    FAIRFIELD PD        $ 656.07\n                                                                                                    GENERAL FUND        $ 242.24\n\n19-3887          Rio Vista      1           5      Other                 $ 4,277.43    07/30/2020\n\n                                                                                                    CDAA                  $ 42.78\n                                                                                                    DA OFFICE            $ 427.74\n                                                                                                    GENERAL FUND       $ 1,026.58\n                                                                                                    VACAVILLE PD       $ 2,780.33\n\n19-3889          Vacaville      1           4      Other                 $ 4,864.29    07/30/2020\n\n                                                                                                    CDAA                 $ 48.64\n                                                                                                    DA OFFICE            $ 486.43\n                                                                                                    GENERAL FUND       $ 1,167.43\n                                                                                                    VACAVILLE PD       $ 3,161.79\n\nSOLANO                                                                                                             $ 14,524.34\n\n\n\n\n                                                                                                                                    249\n"

Location

tabulizer::extract_tables(
  here::here("./week_6/pdf_data/2020_solano.pdf"))
[[1]]
     [,1]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
[1,] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
[2,] "Admin Number /\rDocket NumberCitySuspectsOffensesDispositionAmountDateAmount\rForfeitedDisbursedRecipient\rDisbursed\rSOLANO\r19-3865Martinez14Plea Agreement$ 941.3107/20/2020\rCDAA$ 9.41\rCHP$ 611.85\rDA OFFICE$ 94.13\rGENERAL FUND$ 225.92\r19-3882Fairfield00No Charge$ 939.6102/13/2020\rCDAA$ 9.40\rDA OFFICE$ 93.96\rFAIRFIELD PD$ 610.75\rGENERAL FUND$ 225.50\r19-3883Fairfield00No Charge$ 2,492.3602/13/2020\rCDAA$ 24.92\rDA OFFICE$ 249.24\rFAIRFIELD PD$ 1,620.03\rGENERAL FUND$ 598.17\r19-3884Fairfield13Dropped$ 1,009.3402/13/2020\rCharges\rCDAA$ 10.09\rDA OFFICE$ 100.94\rFAIRFIELD PD$ 656.07\rGENERAL FUND$ 242.24\r19-3887Rio Vista15Other$ 4,277.4307/30/2020\rCDAA$ 42.78\rDA OFFICE$ 427.74\rGENERAL FUND$ 1,026.58\rVACAVILLE PD$ 2,780.33\r19-3889Vacaville14Other$ 4,864.2907/30/2020\rCDAA$ 48.64\rDA OFFICE$ 486.43\rGENERAL FUND$ 1,167.43\rVACAVILLE PD$ 3,161.79\rSOLANO\r$ 14,524.34"
     [,2]                            [,3]   [,4]       [,5] [,6]      
[1,] ""                              ""     ""         ""   ""        
[2,] "Admin Number /\rDocket Number" "City" "Suspects" ""   "Offenses"
     [,7]          [,8] [,9]                [,10] [,11]             [,12]
[1,] ""            ""   ""                  ""    ""                ""   
[2,] "Disposition" ""   "Amount\rForfeited" ""    "Date\rDisbursed" ""   
     [,13]       [,14]              
[1,] ""          ""                 
[2,] "Recipient" "Amount\rDisbursed"

Blobs

pdftools::pdf_text(
  here::here("./week_6/pdf_data/2020_solano.pdf"))
[1] "                 State of California – Asset Forfeiture Report 2020\n\n                                                             Table 3\n\nAdmin Number /                                                         Amount           Date                           Amount\n                  City       Suspects   Offenses   Disposition                                       Recipient\nDocket Number                                                          Forfeited      Disbursed                       Disbursed\n\n\nSOLANO\n19-3865          Martinez       1           4      Plea Agreement         $ 941.31     07/20/2020\n\n                                                                                                    CDAA                   $ 9.41\n                                                                                                    CHP                 $ 611.85\n                                                                                                    DA OFFICE            $ 94.13\n                                                                                                    GENERAL FUND        $ 225.92\n\n19-3882          Fairfield      0           0      No Charge              $ 939.61     02/13/2020\n\n                                                                                                    CDAA                   $ 9.40\n                                                                                                    DA OFFICE            $ 93.96\n                                                                                                    FAIRFIELD PD        $ 610.75\n                                                                                                    GENERAL FUND        $ 225.50\n\n19-3883          Fairfield      0           0      No Charge             $ 2,492.36    02/13/2020\n\n                                                                                                    CDAA                 $ 24.92\n                                                                                                    DA OFFICE           $ 249.24\n                                                                                                    FAIRFIELD PD       $ 1,620.03\n                                                                                                    GENERAL FUND         $ 598.17\n\n19-3884          Fairfield      1           3      Dropped               $ 1,009.34    02/13/2020\n                                                   Charges\n                                                                                                    CDAA                 $ 10.09\n                                                                                                    DA OFFICE           $ 100.94\n                                                                                                    FAIRFIELD PD        $ 656.07\n                                                                                                    GENERAL FUND        $ 242.24\n\n19-3887          Rio Vista      1           5      Other                 $ 4,277.43    07/30/2020\n\n                                                                                                    CDAA                  $ 42.78\n                                                                                                    DA OFFICE            $ 427.74\n                                                                                                    GENERAL FUND       $ 1,026.58\n                                                                                                    VACAVILLE PD       $ 2,780.33\n\n19-3889          Vacaville      1           4      Other                 $ 4,864.29    07/30/2020\n\n                                                                                                    CDAA                 $ 48.64\n                                                                                                    DA OFFICE            $ 486.43\n                                                                                                    GENERAL FUND       $ 1,167.43\n                                                                                                    VACAVILLE PD       $ 3,161.79\n\nSOLANO                                                                                                             $ 14,524.34\n\n\n\n\n                                                                                                                                    249\n"

Areas

tabulizer::extract_tables(
  here::here("./week_6/pdf_data/2020_solano.pdf"),
  area = list(c("top" = 171.95982, "left" = 76.13792,
                "bottom" = 520.42466, "right" = 425.10342)),
  guess = FALSE)
[[1]]
     [,1]      [,2]        [,3] [,4] [,5]             [,6]         [,7]        
[1,] "19-3865" "Martinez"  "1"  "4"  "Plea Agreement" "$ 941.31"   "07/20/2020"
[2,] "19-3882" "Fairfield" "0"  "0"  "No Charge"      "$ 939.61"   "02/13/2020"
[3,] "19-3883" "Fairfield" "0"  "0"  "No Charge"      "$ 2,492.36" "02/13/2020"
[4,] "19-3884" "Fairfield" "1"  "3"  "Dropped"        "$ 1,009.34" "02/13/2020"
[5,] ""        ""          ""   ""   "Charges"        ""           ""          
[6,] "19-3887" "Rio Vista" "1"  "5"  "Other"          "$ 4,277.43" "07/30/2020"
[7,] "19-3889" "Vacaville" "1"  "4"  "Other"          "$ 4,864.29" "07/30/2020"

Tools - OCR

Almost all OCR in R is done using the tesseract package.


Other packages may provide helper function to make using it easier.


Based on the now open source Google Tesseract code.

tesseract::ocr(
  here::here(
    "./week_6/pdf_data/img/testocr.png"))
[1] "This is a lot of 12 point text to test the\nocr code and see if it works on all types\nof file format.\n\nThe quick brown dog jumped over the\nlazy fox. The quick brown dog jumped\nover the lazy fox. The quick brown dog\njumped over the lazy fox. The quick\nbrown dog jumped over the lazy fox.\n"

Tools - Blobs & Metadata

The pdftools package provides a number of functions for working with PDFs.


You can get some metadata using the pdf_info() function.


You can get the actual contents of the PDF using pdf_text(), but it will all be one giant blob.

pdftools::pdf_info(pdf_path)[[5]][1:3]
$Author
[1] "CA DOJ"

$Creator
[1] "Adobe InDesign 16.1 (Macintosh)"

$Keywords
[1] "2020, Asset, Forfeiture, Report,"
substr(pdftools::pdf_text(pdf_path), 1, 1000)
[1] "                 State of California – Asset Forfeiture Report 2020\n\n                                                             Table 3\n\nAdmin Number /                                                         Amount           Date                           Amount\n                  City       Suspects   Offenses   Disposition                                       Recipient\nDocket Number                                                          Forfeited      Disbursed                       Disbursed\n\n\nSOLANO\n19-3865          Martinez       1           4      Plea Agreement         $ 941.31     07/20/2020\n\n                                                                                                    CDAA                   $ 9.41\n                                                                                                    CHP                 $ 611.85\n                                                                                                    DA OFFICE            $ 94.13\n"

Tools - Areas

The tabulizer package lets you interactively extract text from specific areas on a page in a PDF.


It also has tools to automatically extract data from (well formatted) tables.


Even if you aren’t after tables, being able to get text from specific areas is helpful.

tabulizer::extract_tables(pdf_path,
  guess = FALSE,
  area = list(c(170, 76, 514, 424)))[[1]][, 1:5]
     [,1]      [,2]        [,3] [,4] [,5]            
[1,] "19-3865" "Martinez"  "1"  "4"  "Plea Agreement"
[2,] "19-3882" "Fairfield" "0"  "0"  "No Charge"     
[3,] "19-3883" "Fairfield" "0"  "0"  "No Charge"     
[4,] "19-3884" "Fairfield" "1"  "3"  "Dropped"       
[5,] ""        ""          ""   ""   "Charges"       
[6,] "19-3887" "Rio Vista" "1"  "5"  "Other"         
[7,] "19-3889" "Vacaville" "1"  "4"  "Other"         

Code-Demo

For Next Time

Topic

Lab 5, Quiz 2 Open

To-Do

  • Finish Worksheet