Skip to content

Search for Research Data

Search the ANDS Site

Search
Search
http://www.ands.org.au http://www.ands.org.au
Share
Share

Thing 21: Tools of the (dirty data) trade

Dig in to dirty data.  What is it?  Why should we care? Try your hand at using an open source data cleansing tool.

  • Getting started: Pregnant men? The data says so! Getting down and dirty with data
  • Learn more: Turn a PDF ‘tabula rasa’ into usable data with Tabula
  • Challenge me: OpenRefine is a powerful tool for cleaning up lots of  dirty data

Getting startedLearn moreChallenge me


Getting started

Thing 21: Tools of the (dirty data) trade

Data horror stories: how did it happen!?

Why is "clean” data important? Public policy, changes to medical protocols and economic decisions all depend on accurate and complete data. Thing 21 looks at the why and what of “dirty data.”

  1. Pregnant men? Yes! Read some real data horror stories about dirty data
  2. How does data get dirty in the first place? This short video (5mins) from the ASPCA (American Society for Prevention of Cruelty to Animals) shows simply how incomplete, inaccurate data can occur. You can imagine how the resultant problems would multiply exponentially the bigger the dataset
  3. Browse down the Bad Data Guide list of commonly encountered data quality issues (with possible solutions). This list is aimed at journalists but it shows who is responsible for cleaning up dirty data.

Click into a few of the causes and solutions to dirty data - many of us contribute information to reports or do our home accounts in spreadsheets, and maybe it’s time to think about how clean our own data is!

If you have time: For a quick guide to working with spreadsheets, check out one of the School of Data’s Data Fundamentals course. The modules use real data from eg the World Bank.

Consider: the wide ranging implications of how dirty data can impact on your life.

Do you have a question?  Want to share a resource?


thing21 credly badge

Completion of Thing 21 means you've earned a digital badge!

Claim your badge

Go to Thing 22 What’s in a name?


Learn more

Thing 21: Tools of the (dirty data) trade

How often have you found some data that looks interesting, but they’re in PDFs or on a webpage… how do you get the data into spreadsheets so you can work with them?

The School of Data has fantastic, easy to follow tutorials working with real data.

Option 1:

Let’s start extracting tabular data from text-based PDFs. The Extracting Data From PDFs module provides a brief overview of the different techniques used to extract data from PDFs, with a focus on introducing Tabula, a free open-source tool build for this specific task.

  1. Get ready: go to Extracting Data From PDFs
    1. Download the correct version of Tabula for your operating system, and java runtime if required
    2. note this tutorial doesn’t work on scanned pdfs
  2. Work through as much of the Tabula tutorial as you can and remember this tutorial for the next time you get a PDF with valuable (and hard-to-extract) data!

Option 2: As much as we wish everything was available in CSV or the format of our choice – most data on the web is published in different forms. How do you extract data from HTML? Use a Scraper!

  1. Got to Making data on the web useful: scraping and follow the two ‘recipes’ to learn code-free Scraping in 5-10 minutes using Google Spreadsheets & Google Chrome

(Note: Use the Google Chrome Extension “Scraper, by dvhtn”)

If you have time or just love data dabbling:

Extracting data from PDFs will inevitably result in some dirty data creeping into your dataset. The School of Data have some really interesting Data Cleansing modules.

Consider: strategies for encouraging data to be published in more re-usable formas rather than PDF.

Do you have a question?  Want to share a resource?


thing21 credly badge

Completion of Thing 21 means you've earned a digital badge!

Claim your badge

Go to Thing 22 What’s in a name?


challenge me

Thing 21: Tools of the (dirty data) trade

OpenRefine (formerly Google Refine) is a valuable open source tool that is similar to Excel but more powerful. You can use it to: record data; manipulate data; clean up dirty data; and to transform datasets.

Option 1: If you are new to OpenRefine

  1. Start by watching introduction to OpenRefine (6 .48mins) to learn how it can be used to clean up messy data.
  2. Now get hands on!  You will need to download OpenRefine and the dataset for this activity. Work your way through as much as this tutorial as you can. The sections covering Facets and Transforming Values will give you a flavour for what the OpenRefine tool offers.

Option 2: if you’re already familiar with OpenRefine and might be ready to share what you know

Take a look through the lesson materials used to teach OpenRefine in either:

  1. OpenRefine in Library Carpentry
  2. OpenRefine in Data Carpentry for Ecology

Consider: Carpentry course Trainers and Helpers are often in high demand. Would you consider becoming a trainer to teach tools such as Open Refine?

Do you have a question?  Want to share a resource?


thing21 credly badge

Completion of Thing 21 means you've earned a digital badge!

Claim your badge

Go to Thing 22 What’s in a name?


in 23 (research data) Things