What do we mean by the term ‘data versioning’?
A version is “a particular form of something differing in certain respects from an earlier form or other forms of the same type of thing”. In the research environment, we often think of versions as they pertain to resources such as manuscripts, software or data. We may regard a new version to be created when there is a change in the structure, contents, or condition of the resource.
In the case of research data, a new version of a dataset may be created when an existing dataset is reprocessed, corrected or appended with additional data. Versioning is one means by which to track changes associated with ‘dynamic’ data that is not static over time.
Why is data versioning important?
Increasingly, researchers are required to cite and identify the exact dataset used as a research input in order to support research reproducibility and trustworthiness. This means the researcher needs to be able to accurately indicate exactly which version of a dataset underpins their research findings. This becomes particularly challenging where the data to be cited are ‘dynamic’ - for example, a subset of a large dataset accessed via a web service.
These two typical scenarios demonstrate why data versioning is important:
A researcher has used a data collection to verify a research hypothesis, and has published the research result. The researcher should cite exactly the version of the data collection used to enable other researchers to verify the research result and/or do comparison studies.
A researcher has submitted a manuscript to a journal, and was subsequently asked to re-run an experiment during the reviewing process. A different result was obtained after re-running the experiment. After checking models and software, the researcher suspects the data might have changed. The researcher therefore needs to obtain the version used earlier to produce the same result, or to understand the differences between versions in order to make a judgment about whether changes in the data would affect the research output, or even the research hypothesis.
These scenarios highlight the importance of versioning data with a versioning indicator and history available. Versioning supports specificity and verifiability and enables a particular version of a data collection to be uniquely referenced.
This concept is summarised well in the W3C Data on the Web Best Practices guide:
“Version information makes a revision of a dataset uniquely identifiable. Uniqueness can be used by data consumers to determine whether and how data has changed over time and to determine specifically which version of a dataset they are working with. Good data versioning enables consumers to understand if a newer version of a dataset is available. Explicit versioning allows for repeatability in research, enables comparisons, and prevents confusion. Using unique version numbers that follow a standardized approach can also set consumer expectations about how the versions differ. Intended outcome: Humans and software agents will easily be able to determine which version of a dataset they are working with.”
What’s the problem?
There is currently no agreed standard or recommendation among data communities as to why, how and when data should be versioned. Some data providers may not retain a history of changes to a dataset, opting to make only the most recent version available. Other data providers have documented data versioning policies or guidelines based on their own discipline’s practice, which may not be applicable to other disciplines.
There is currently a discussion in the global community as to the need for, or indeed possibility of, an agreed best practice for data versioning across data communities. In the meantime, a variety of data versioning practices and guidelines are described below.
Numbering of data versions is a relatively well established, yet inconsistently used, practice. A consistent version numbering scheme enables data users to:
- track whether a collection has changed and if a new version is available
- determine specifically which version they used before and which version they are working with now
- set expectations about how each version would differ.
Software development has established rules for numbering each new release of a software. The most commonly adopted approach is a three-part semantic versioning number convention: Major.Minor.Patch (e.g. 2.1.5), with Major indicating changes are incompatible with the previous version, Minor indicating new functionality in a backwards-compatible manner, and Patch indicating backward-compatible bug fixes.
Unlike the software domain, the data community doesn’t yet have a standard numbering system. Three representative data version numbering patterns in use include:
Numbering system 1
Data versioning follows a similar path to software versioning, usually applying a two-part numbering rule: Major.Minor (e.g. V2.1). Major data revision indicates a change in the formation and/or content of the dataset that may bring changes in scope, context or intended use. For example, a major revision may increase or decrease the statistical power of a collection, require change of data access interfaces, or enable or disable answering of more or less research questions. A Major revision may incorporate:
- substantial new data items added to /deleted from a collection
- data values changed because temporal and/or spatial baseline changes
- additional data attributes introduced
- changes in a data generation model
- format of data items a changed
- major changes in upstream datasets.
Minor revisions often involve quality improvement over existing data items. These changes may not affect the scope or intended use of initial collection. A Minor revision may include:
- renaming of data attribute
- correction of errors in existing data
- re-running a data generation model with adjustment of some parameters
- minor changes in upstream datasets.
As version numbers are human generated, what constitutes a ‘major’ or ‘minor’ change is a subjective decision and the above examples may vary from discipline to discipline in practice.
Some data providers may use three part numbering system: Major.Minor.[Patch]. The Patch version is used only for the purpose of internally tracking without being exposed externally.
Numbering system 2
This pattern simply applies one digital number to indicate the data has been revised and versioned (e.g. V1, V2). This approach is suitable for researchers working on their own dataset, and don’t need to distinguish between major or minor changes.
Numbering system 3
In some disciplines such as Astronomy or Marine science, data providers typically process raw data into versions of various levels (e.g. Level 0, Level 1, …), each level introduces increasing level of scientific interpretation and/or data quality control.
For example, NASA uses the following categories to indicate how close each versioned data is to the raw data and what types of underlying models and/or algorithms are applied to raw data or a previous version:
- Level 0 data being the ‘raw’ data from the satellite.
- Level 1 data being calibrated and geolocated, keeping the original sampling pattern.
- Level 2 data being converted into geophysical parameters but still with the original sampling pattern.
- Level 3 data being resampled, averaged over space, and interpolated/averaged over time.
IMOS netCDF Conversion combines the levels of quality control and the levels of scientific interpretation (see reference table 4, page 77).
In principle, each change to a collection, whether major or minor, should be tracked and recorded to ensure the dataset is trustworthy and reproducible. The timing for making and releasing a new version (major or minor) is usually determined by a business process (e.g. continuous or periodic), magnitude of changes (e.g. insignificant changes can be accumulated until release of a version with minor changes) or by special request.
Ideally, if a collection is revised and its earlier versions have been used and cited, those data users should be notified in case the changes (e.g. bugs in model) affect research outcomes (either in published research papers or in policy statements).
What tools are available for data versioning?
There is no one-size-fit-all solution for data versioning and tracking changes. Data come in different forms and are managed by different tools and methods. In principle, data managers should take advantage of data management tools that support versioning and track changes.
Example approaches include:
Git (and Github) for Data (with size <10Mb or 100k rows) which allows:
- effective distributed collaboration – you can take my dataset, make changes, and share those back with me (and different people can do this at once)
- provenance tracking (i.e. what changes came from where)
- sharing of updates and synchronizing datasets in a simple, effective, way. Read more
- Users of ArcGIS can create a geodatabase version, derived from an existing version. When you create a version, you specify its name, an optional description, and the level of access other users have to the version. As the owner of the version, you can change these properties or delete a version at any time.
Citation of versioned data
There is no one way to cite versioned data. The form of citation statement will depend on a number of factors including publisher instructions, research domain and type of data. Citations to revisable datasets are likely to include version numbers or access dates.
DataCite recommends this format for citing data with a version number:
Creator(s) (Publication Year): Title. Version. Publisher. Identifier.
Bradford, Matt; Murphy, Helen; Ford, Andrew; Hogan, Dominic; Metcalfe, Dan (2014): CSIRO Permanent Rainforest Plots of North Queensland. v2. CSIRO. http://doi.org/10.4225/08/53C4CC1D94DA0
DOIs for versioned data
There is no one model for assigning DOIs to versioned data. The Digital Curation Centre recommends that “data repositories should ensure that different versions are independently citable (with their own identifiers).” However, this recommendation may be not be applicable across data types and domains. For example, the Federation of Earth Science Information Partners (ESIP) specifies that a new DOI should be minted for each ‘major’ but not ‘minor’ version. Some common practical approaches are provided below.
A paper and its associated data collection have been published. Later, the collection is expanded and new research is done based on the expanded collection. In this case, the expanded collection can be seen as a new version of the original collection and assigned a version number. The new version and the original version can share the same DOI, so that a researcher can find the exact version used for the first paper, but can also see the latest version.
Garlick, Cathy, 2015, CCAFS Household Baseline Study, Latin America & South East Asia (2014-2015)", doi:10.7910/DVN/PWVLTU, Harvard Dataverse, V1
Garlick, Cathy, 2015, "CCAFS Household Baseline Study, Latin America & South East Asia (2014-2015)", doi:10.7910/DVN/PWVLTU, Harvard Dataverse, V2
Note that the two citations above have the same DOI and therefore the same landing page but each citation specifies a specific version. If a user follows the DOI link, the most current version is returned.
This scenario is similar to Example 1 above, but in this case, the DOI includes a version number, for example:
Each has its own landing page, however if the version number is omitted (doi:10.5061/dryad.j4315) the most current version is returned.
Note that not all DOI allocation agencies support the minting of a DOI with versioning number.
When a new version of a dataset is published, a new metadata landing page is created and a new DOI is minted. In this example taken from the CSIRO Data Access Portal (DAP), the DOI for the previous version will resolve to the metadata landing page for that version.
However, a message displays in the record advising a newer version is available with a link to the more recent version provided. A search in DAP or Research Data Australia will retrieve only the latest version.
Click on the DOIs in the citation statements below to view an example.
Harwood, Tom; Williams, Kristen; Ferrier, Simon; Ota, Noboru; Perry, Justin; Langston, Art; Storey, Randal (2014): 9-second gridded continental Australia change in effective area of similar ecological environments (cleared natural areas) for Amphibians 1990:1990 (GDM: AMP_r2_PTS1). v1. CSIRO. Data Collection. http://doi.org/10.4225/08/54815C68BEF05
Harwood, Tom; Williams, Kristen; Ferrier, Simon; Ota, Noboru; Perry, Justin; Langston, Art; Storey, Randal (2014): 9-second gridded continental Australia change in effective area of similar ecological environments (cleared natural areas) for Amphibians 1990:1990 (GDM: AMP_r2_PTS1). v2. CSIRO. Data Collection. http://doi.org/10.4225/08/5486764AD2F64
Note that an identifier such as DOI is not only used by human users to track a record or a dataset, but also by software agents to request a record or a dataset.
If a data provider implements a data access API using the DOI as a record/dataset locator, the DOI practices in Examples 2 and 3 are more machine friendly than that in Example 1.
- Ball, A. & Duke, M. (2015). ‘How to Cite Datasets and Link to Publications’. DCC How-to Guides. Edinburgh: Digital Curation Centre.
- Federation of Earth Science Information Partners. (2012) Interagency Data Stewardship/Citations/Provider Guidelines.
- NRC-CISTI (n.d.) Datasets and DOIs: guidelines from DataCite Canada
- Stanford University Libraries (n.d.) Data versioning
- Starr, J. and Gastl, A. (2011). isCitedby: A Metadata Scheme for DataCite. D-Lib Magazine. Vol 17(1/2). doi:10.1045/january2011-starr