ANDS is facilitating an (informal) monthly meeting for the Australian Research Data Provenance (RDP) Interest Group. The goals of the group are to bring together people who work in the provenance space, facilitate conversation, discuss a range of provenance issues, reduce duplication of effort, and coordinate community activities.
What is data provenance?
Data provenance is used to document where a piece of data comes from and the process and methodology by which it is produced.
The word provenance originates from the French term 'provenir' meaning 'to come from' and is also known as 'lineage' or 'pedigree'. Provenance, as a practice, has been used in the context of art history to document history of an artwork or in the context of digital libraries, to document a digital object's life cycle. In a similar way, data provenance, a kind of metadata, is important to confirm the authenticity of data. This is becoming increasingly important, especially in the eScience community where research is data intensive and often involves complex data transformations and procedures.
The W3C Provenance Incubator Group defines provenance of a resource as:
"a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions are a form of contextual metadata and can themselves become important records with their own provenance."
This definition states that provenance is associated with not just a data product's history back in time, but also with the relationships between a data product and other entities that enable the creation of the data as well.
Why we need provenance
The above definition makes it clear that the whole idea of provenance is about trust, credibility and reproducibility. In data intensive science, the data users are not likely to be the data producers. Data producers may configure an instrument or simulation in a certain way to collect primary data, or apply certain methodologies and processes to extract, transform and analyse an input data to produce an output data.
The provision of provenance as part of their published data (either primary or secondary) is important for determining the quality, the amount of trust one placed on the results, and reproducibility of result and reusability of that data.
For data users, the scientific basis of their analysis and accountability of their research rely largely on the credibility and trustworthiness of their input data, and so they may want to check data quality along with expected level of imprecision.
How to manage provenance records
It still remains a domain specific issue as to what a provenance record should include and at what granularity, and how provenance records should be managed and disseminated. In any case, data provenance system implementers should consult users' requirements to assure provenance records and system can be used to answer users' questions.
The W3C Provenance Working Group has worked on and provides a series of recommendations on provenance data model, ontology, vocabulary and representation schema which can be used as guidelines.
Where to get further provenance information
The W3C Provenance Working Group recommends five specifications including: PROV Primer, PROV Ontology (PROV-O), PROV Data Model (PROV-DM), PROV Notation (PROV-N), PROV Constraints, PROV Access and query.
The International Provenance and Annotation Workshop (IPAW) is a biannual workshop that is concerned with issues of data provenance, data derivation, and data annotation. It brings together computer scientists from different areas and provenance users to discuss open problems related to the provenance of computation and non-computational artefacts.
The IPAW website has links to workshops held in the past, from where you can access workshop papers and presentation slides.
Provenance and social science data webinar (15 March 2017)
Snippet: 1.05 - Provenance and social science data (George Altar)
Snippet: 0:40 - Introduction to PROV (Nick Car)
Snippet: 0:36 - Managing provenance in the Social Sciences: The DDI initiative (Steve McEachern)
Watch the full webinar recording: Provenance and social science data (YouTube, 54:22)
How to get involved
The group is open to anybody interested in data provenance.
If you would like to join the group, contact firstname.lastname@example.org.