What is data provenance?
Data provenance is the documentation of where a piece of data comes from and the processes and methodology by which it was produced.
The word provenance originates from the French term 'provenir' meaning 'to come from' and is also known as 'lineage' or 'pedigree'. Provenance, as a practice, has been used in the context of art history to document the history of an artwork; and in digital libraries to document a digital object's lifecycle. Simarlily, recording data provenance, a type of metadata, is important to confirm the authenticity of data and to enable it to be reused.
The W3C Provenance Incubator Group defines the provenance of a resource as:
"a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions are a form of contextual metadata and can themselves become important records with their own provenance."
This definition states that provenance is associated not just with a data product's history in time, but also with the relationships between a data product and other entities that enable the creation of the data.
Put simply, provenance answers the questions why and how the data was produced, where and when and by whom.
Metadata that records provenance information in a record from Research Data Australia (highlighted in red)
Why we need provenance
The above definition makes it clear that the whole idea of provenance is about trust, credibility and reproducibility. In data intensive research, the data users are not likely to be the data producers. Data producers may configure an instrument or simulation in a certain way to collect primary data, or apply certain methodologies and processes to extract, transform and analyse input data to produce an output data product.
The provision of provenance metadata as part of the published data is important for determining the quality, the amount of trust one can place on the results, the reproducibility of results and reusability of the data.
For data users, the scientific basis of their analysis and accountability of their research rely largely on the credibility and trustworthiness of their input data and so they may want to check data quality along with expected level of imprecision.
How to record and manage provenance
Provenance is recorded as a type of metadata about the data product; many metadata fields routinely collected fall into the category of provenance information, e.g. date created, creator, instrument or software used, data processing methods, etc. Thus, good data management forms the basis of accurately recording provenance.
Approaches to capture and represent provenance can be described on a number of dimensions:
- recorded in a text string; using generic or discipline-specific schema; or a provenance data model
- captured internally within a software tool or program; or in an external system
- represented in machine readable and/or human readable form.
In its simplest form, provenance can be recorded in a single README text file that describes the data collection and processing methods used. Provenance can also be recorded in a more structured way using specific elements in very generic metadata standards such as Dublin Core, to discipline-specific metadata standards such as ISO 19115-2. Alternatively, provenance information can be described directly in the W3C Provenance Data Model (PROV-DM) and Provenance Ontology (PROV-O). Provenance information captured in Dublin Core and domain-specific schema can be mapped to a PROV-O representation, so that provenance can be viewed at the domain-specific level and at the more abstract PROV-O level.
Simplified diagram of provenance modelled in W3C PROV-O
Provenance trails can be captured internally by software tools during their processing activity, for example workflow systems such as Kepler, Galaxy or Taverna. The provenance information is typically only available to other users of the same system or may be able to be exported to a separate provenance store. Systems that adopt the internal approach tend to capture provenance in proprietary ways. Systems that adopt an external approach often use a standard such as W3C PROV-O because they need to interact with many different kinds of systems.
Finally, provenance information can be captured in a way that supports machine-to-machine interactions (for instance, to allow resource identification and location and workflows to be rerun) and/or at a higher level that allows for human users to more easily read the provenance trail of a data product or a data processing workflow. In some cases, this might just be a textual description, but might also involve a visualisation of the machine-readable representation such as VisTrails.
Provenance panel webinar (18 September 2018)
Matt Miles from Department of Environment and Water, SA and Karl Monnik from Bureau of Meteorology talk Provenance with Tom Honeyman from ARDC. They discuss how their institutions are capturing and managing provenance information, the benefits, and the challenges faced.
Where to get further provenance information
The W3C Provenance Working Group recommends five specifications including: PROV Primer, PROV Ontology (PROV-O), PROV Data Model (PROV-DM), PROV Notation (PROV-N), PROV Constraints, PROV Access and query.
The International Provenance and Annotation Workshop (IPAW) is a biannual workshop that is concerned with issues of data provenance, data derivation and data annotation. It brings together computer scientists from different areas and provenance users to discuss open problems related to the provenance of computation and non-computational artefacts.
The IPAW website has links to workshops held in the past. You can access workshop papers and presentation slides.
Provenance and social science data webinar (15 March 2017)
Provenance and social science data (George Altar) (65 sec)
Introduction to PROV (Nick Car) (40 sec)
Managing provenance in the Social Sciences: The DDI initiative (Steve McEachern) (36 sec)
Provenance and social science data (full webinar: 54 min)
Are you interested in data provenance?
ANDS facilitates informal monthly meetings for the Australian Research Data Provenance (RDP) Interest Group. The group is open to anybody interested in data provenance.