I recently published a story that was based on some data analysis I did of a report I obtained from the Department of Behavioral Health and Developmental Services in VA. I wanted to share a quick walkthrough of how I extracted the data from tables in a PDF using a Python module called PDFplumber. I also uploaded a video to Youtube if you prefer that.
By using PDFplumber, I was able to create a graph which shows the trend at the center of my article. I hope some of you can take something away from this walkthrough that will help you supplement your own reporting, especially if you're interested in data journalism.
In order to use PDFplumber, you need a Python IDE installed. A lot of data analysts will use Jupyter Notebooks, but I use VS code. If you don't know any Python, it's not too hard, and I'd recommend the Socratica videos on youtube.
You can use pip to install and import PDFplumber into your script, and from there the first thing you want to do is create a variable to open the pdf:
pdf = pdfplumber.open("tdostats.pdf")
Note that the PDF will need to be OCR'd for this to work. From there, you will need to use the "pages" class to make it possible to extract data in Python. For example, I created a list of pages that I wanted to extract from and then used list comprehension to create a list of PDFplumber "pages" objects:
select_pages = [1, 2, 3, 4, 5, 7, 10, 13]
doc_pages = list(pdf.pages[i] for i in select_pages)
From there, you can use the extract_table method to take data from the table. Depending on how many tables on the page you want to extract, you may want to use the extract_tables method instead. Refer to the PDFplumber documentation for more info.
In my case, I wanted to extract the month (and year) as well as a "total events" value from the table on each page, and because the top of the tables were messy, I decided to count backwards. Because each table was one fiscal year (Jul-Dec one year and then Jan-June the following year), I split the table in two to make things easier.
for page in doc_pages:
table_first_half = page.extract_table()[-13:-7]
table_second_half = page.extract_table()[-7:-1]
That's it! I wrote some additional code to pull the values from the table and clean the data before creating the final visualization.
I'm by no means an expert coder, very much a beginner, so if there are things I could have done better let me know. That being said, I hope this walkthrough proves that any journalist can use programming to enhance their work, so you should try it if you haven't already!