Thing 21: Tools of the (dirty data) trade
How often have you found some data that looks interesting, but they’re in PDFs or on a webpage… how do you get the data into spreadsheets so you can work with them?
The School of Data has fantastic, easy to follow tutorials working with real data.
Let’s start extracting tabular data from text-based PDFs. The Extracting Data From PDFs module provides a brief overview of the different techniques used to extract data from PDFs, with a focus on introducing Tabula, a free open-source tool build for this specific task.
- Get ready: go to Extracting Data From PDFs
- Download the correct version of Tabula for your operating system, and java runtime if required
- note this tutorial doesn’t work on scanned pdfs
- Work through as much of the Tabula tutorial as you can and remember this tutorial for the next time you get a PDF with valuable (and hard-to-extract) data!
Option 2: As much as we wish everything was available in CSV or the format of our choice – most data on the web is published in different forms. How do you extract data from HTML? Use a Scraper!
- Got to Making data on the web useful: scraping and follow the two ‘recipes’ to learn code-free Scraping in 5-10 minutes using Google Spreadsheets & Google Chrome
(Note: Use the Google Chrome Extension “Scraper, by dvhtn”)
If you have time or just love data dabbling:
Extracting data from PDFs will inevitably result in some dirty data creeping into your dataset. The School of Data have some really interesting Data Cleansing modules. Consider: strategies for encouraging data to be published in more re-usable formats rather than PDF.