What do we mean by the term ‘data versioning’?
A version is “a particular form of something differing in certain respects from an earlier form or other forms of the same type of thing”. In the research environment, we often think of versions as they pertain to resources such as manuscripts, software or data. We may regard a new version to be created when there is a change in the structure, contents, or condition of the resource.
In the case of research data, a new version of a dataset may be created when an existing dataset is reprocessed, corrected or appended with additional data. Versioning is one means by which to track changes associated with ‘dynamic’ data that is not static over time.
Why is data versioning important?
Increasingly, researchers are required to cite and identify the exact dataset used as a research input in order to support research reproducibility and trustworthiness. This means the researcher needs to be able to accurately indicate exactly which version of a dataset underpins their research findings. This becomes particularly challenging where the data to be cited are ‘dynamic’ - for example, a subset of a large dataset accessed via a web service.
These typical scenarios demonstrate why data versioning is important:
These scenarios highlight the importance of versioning data with a versioning indicator and history available. Versioning supports specificity and verifiability and enables a particular version of a data collection to be uniquely referenced.
This concept is summarised well in the W3C Data on the Web Best Practices guide:
“Version information makes a revision of a dataset uniquely identifiable. Uniqueness can be used by data consumers to determine whether and how data has changed over time and to determine specifically which version of a dataset they are working with. Good data versioning enables consumers to understand if a newer version of a dataset is available. Explicit versioning allows for repeatability in research, enables comparisons, and prevents confusion. Using unique version numbers that follow a standardized approach can also set consumer expectations about how the versions differ. Intended outcome: Humans and software agents will easily be able to determine which version of a dataset they are working with.”
What’s the problem?
There is currently no agreed standard or recommendation among data communities as to why, how and when data should be versioned. Some data providers may not retain a history of changes to a dataset, opting to make only the most recent version available. Other data providers have documented data versioning policies or guidelines based on their own discipline’s practice, which may not be applicable to other disciplines.
There is currently a discussion in the global community as to the need for, or indeed possibility of, an agreed best practice for data versioning across data communities. In the meantime, a variety of data versioning practices and guidelines are described below.
Numbering of data versions is a relatively well established, yet inconsistently used, practice. A consistent version numbering scheme enables data users to:
- track whether a collection has changed and if a new version is available
- determine specifically which version they used before and which version they are working with now
- set expectations about how each version would differ.
Software development has established rules for numbering each new release of a software. The most commonly adopted approach is a three-part semantic versioning number convention: Major.Minor.Patch (e.g. 2.1.5), with Major indicating changes are incompatible with the previous version, Minor indicating new functionality in a backwards-compatible manner, and Patch indicating backward-compatible bug fixes.
Unlike the software domain, the data community doesn’t yet have a standard numbering system. Three representative data version numbering patterns in use include:
What tools are available for data versioning?
There is no one-size-fit-all solution for data versioning and tracking changes. Data come in different forms and are managed by different tools and methods. In principle, data managers should take advantage of data management tools that support versioning and track changes.
Example approaches include:
Git (and Github) for Data (with size <10Mb or 100k rows) which allows:
- effective distributed collaboration – you can take my dataset, make changes, and share those back with me (and different people can do this at once)
- provenance tracking (i.e. what changes came from where)
- sharing of updates and synchronizing datasets in a simple, effective, way.
- Users of ArcGIS can create a geodatabase version, derived from an existing version. When you create a version, you specify its name, an optional description, and the level of access other users have to the version. As the owner of the version, you can change these properties or delete a version at any time.
Citation of versioned data
There is no one way to cite versioned data. The form of citation statement will depend on a number of factors including publisher instructions, research domain and type of data. Citations to revisable datasets are likely to include version numbers or access dates.
DataCite recommends this format for citing data with a version number:
Creator(s) (Publication Year): Title. Version. Publisher. Identifier.
- Bradford, Matt; Murphy, Helen; Ford, Andrew; Hogan, Dominic; Metcalfe, Dan (2014): CSIRO Permanent Rainforest Plots of North Queensland. v2. CSIRO http://doi.org/10.4225/08/53C4CC1D94DA0
DOIs for versioned data
There is no one model for assigning DOIs to versioned data. The Digital Curation Centrerecommends that “data repositories should ensure that different versions are independently citable (with their own identifiers).” However, this recommendation may be not be applicable across data types and domains. For example, the Federation of Earth Science Information Partners (ESIP) specifies that a new DOI should be minted for each ‘major’ but not ‘minor’ version. Some common practical approaches are provided below.
Note that an identifier such as DOI is not only used by human users to track a record or a dataset, but also by software agents to request a record or a dataset.
If a data provider implements a data access API using the DOI as a record/dataset locator, the DOI practices in Examples 2 and 3 are more machine friendly than that in Example 1.
- Ball, A. & Duke, M. (2015), ‘How to Cite Datasets and Link to Publications’, Digital Curation Centre
- Federation of Earth Science Information Partners Interagency Data Stewardship/Citations/Provider Guidelines (2012)
- NRC-CISTI (n.d.) Datasets and DOIs: guidelines from DataCite Canada
- Stanford University Libraries (n.d.) Data versioning
- Starr, J. and Gastl, A. (2011), isCitedby: A Metadata Scheme for DataCite. D-Lib Magazine doi:10.1045/january2011-starr