Why do journals need data policies?
An increasing number of journals are implementing policies and procedures that require published articles to be accompanied by the underlying research data. These policies support initiatives that allow for replication and verification of authors’ published claims.
Journal policies on research data and related materials such as software code are an important part of the shift toward reproducible research. They support, and in some cases have driven, statements, mandates and principles issued by research funders, governments and scientific societies around the world.
ANDS has developed a Research data for journal editors Guide. This guide is intended to provide a starting point for editors considering developing or improving data policies for their journals.
What should a data policy include?
The guide outlines 10 elements to consider in developing or refining a data policy:
Defining data and datasets
Journal data policies vary widely in their definition of terms such as ‘the data’ and ‘the dataset’. Stating the definition of these terms within a data policy will provide clarity and guidance for authors when depositing data associated with article submissions.
Which data to include?
Should it be all data collected or only data relevant to the research reported on in the article?
Transparent and replicable research is a strong motivation behind many data policies. As such, a policy needs to specify what data and related materials should be made available by authors.
Related software, methods and materials
Are related materials required in order for the data to be understood and replicated?
Generally, a policy should specify that the data required is the portion that was used in the research reported in the article and that materials related to the research findings should also be made available. For example, some data cannot be understood or replicated without software code, methods and other related materials so inclusion of these types of materials needs to be made clear in the data policy. For example, Nature’s policy on the availability of data, materials and methods covers a wide range of research objects such as computer code, experimental protocols, clinical trials and more:
Authors must make available upon request, to editors and reviewers, any previously unreported custom computer code used to generate results that are reported in the paper and central to its main claims. Any practical issues preventing code sharing will be evaluated by the editors who reserve the right to decline the paper if important code is unavailable. Upon publication, Nature Journals consider it best practice to release custom computer code in a way that allows readers to repeat the published results.
Discipline conventions for data deposition
Should it be raw or processed data?
A data policy should also specify which data is required. In some disciplines it may be standard to share processed data whereas in others the raw data will be required.
The STM Brussels Declaration, for example, states, “raw research data should be made freely available to all researchers”.
PLOS defines the “minimal data set” to consist of the data set used to reach the conclusions drawn in the manuscript with related metadata and methods, and any additional data required to replicate the reported study findings in their entirety. Authors do not need to submit their entire data set if only a portion of the data were used in the reported study. Also, authors do not need to submit the raw data collected during an investigation if the standard in the field is to share data that have been processed.
Policies need to specify that both the data and associated metadata required to validate the findings in the article should be deposited. Data without metadata is not discoverable.
When creating the metadata record and making the data available, a data policy may advise authors to adhere to the FAIR principles (Findable, Accessible, Interoperable and Reusable) for data management.
The FAIR principles serve to guide data producers and publishers to overcome obstacles to good data management and stewardship. A number of publishers contributed to the development of the FAIR principles and have subsequently endorsed them and incorporated them into publication workflows.
How can researchers make data available if they are not the creators of the data, for example, where secondary data has been used to inform the research?
Not all research projects involve data collection and a data availability policy should take this into account. In the event that authors did not collect data themselves but used existing data, the data policy should specify that they cite the data they used in the same way as they cite publication resources. The citation should include a link to information about the data and where possible, the data itself.
How can authors share data that contains confidential information about survey participants or other sensitive materials?
The rights and privacy of people who participate in research must be protected. Sensitive data are data that can be used to identify an individual, species, object, or location that introduces a risk of discrimination, harm, or unwanted attention. Major, familiar categories of sensitive data are:
- human/medical health and personal data, including information about secret or sacred practices
- ecologicaldata that may place vulnerable species at risk.
The collection of sensitive data does not preclude sharing and the advantages of publishing sensitive data will far outweigh any potential disadvantages when simple and appropriate steps are taken. Sensitive data that has been de-identified can be openly published and shared.
Publishing this data, or a description of the data (a metadata record), means that others can discover it, reuse it and cite it.
A journal data policy may need to refer authors to advice on how to share data that is sensitive. ANDS provides three relevant Guides with links to Australian and international best practice examples that can be included in a journal data policy:
- ANDS Sensitive Data – Publishing and Sharing Guide outlines best practice for the publication and sharing of sensitive research data in the Australian context. A publishing sensitive data decision tree is also available.
- De-identification gives a legal definition of de-identification and collates a selection of Australian and international practical guidelines and resources on how to de-identify datasets. It is intended for those who own a dataset and want to de-identify it for the purpose of sharing or publishing the data.
- Data Sharing Considerations for Human Research Ethics Committees provides an overview for items that members of Human Research Ethics Committees (HRECs) can consider when assessing applications which propose to share data
In some cases, data may not be made public. In this case, it should still be possible to publish the metadata and thereby enable a link from the publication to a metadata record.
Find out more
- Publishing descriptions of non-public clinical datasets: guidance for researchers, repositories, editors and funding organisations (2016) Iain Hrynaszkiewicz, Varsha Khodiyar, Andrew Hufton, Susanna-Assunta Sansone. bioRxiv 021667 doi.org/10.1186/s41073-016-0015-6
COPDESS (Coalition on Publishing Data in the Earth and Space Sciences) Statement of Commitment (2015) has been signed by various publishers and repositories and makes the case for collaboration between publishers and data facility providers to improve data availability and reuse.
Journal data policies should include clear instructions for authors on where to deposit the data accompanying their article. Policies that refer to depositing data in “an appropriate data repository” are not as useful as those that provide criteria for selecting a suitable repository and/or a list of recommended repositories. This is because authors may be unfamiliar with using a data repository and may have concerns about the trustworthiness of repositories. Scientific Data, for example, provides guidelines and a recommended repositories list.
Additional criteria for selecting an appropriate repository may be included in the policy such as:
- that which is appropriate for the discipline
- use of open licenses, inclusion of metadata
- adhering to standards and best practice
- data preservation policy
- mechanisms for data citation and accessibility.
In advising authors which repositories to choose from when depositing their data, a journal policy should include 'institutional repositories', which refers to repositories that are managed by research institutions such as universities. The benefits of depositing data in an institutional repository are many:
- they have been established by an institution with the aim of making research data and related materials more widely accessible
- they are generally run by library staff who have expertise in identifying, selecting, organising, describing, preserving and providing access to research materials
- assistance for data deposit is generally available as a service and staff can provide advice on aspects of data publishing such as licensing
- issuing of identifiers for data, in particular Digital Object Identifiers (DOIs), is often incorporated into deposit workflows
- data citation mechanisms are supported
- the care and preservation of research materials are part of institutional and library goals.
Including institutional repositories in the list of recommended repositories that are included in a journal data policy is therefore highly recommended.
The PLOS policy on data availability, for example, includes an option to deposit in institutional repositories when a domain data repository is not possible:
If no specialized community-endorsed open repository exists, institutional repositories that use open licenses permitting free and unrestricted use or public domain, and that adhere to best practices pertaining to responsible data sharing, sustainable digital preservation, proper citation, and openness are also suitable for data deposition.
Data facility providers
A number of publishers have partnered with data facility providers to incorporate data deposit into the submission process. Examples include:
- Wiley offers a Data Sharing Service through a partnership with Figshare, enabling authors to easily upload data within the existing manuscript submission workflow.
- Dryad offers integrated data submission for over 100 journal partners and other publications.
- Elsevier offers authors upload to Mendeley Data and a Database Linking Tool to create bidirectional links between data repositories and articles on ScienceDirect.
- Springer offer enhanced display and discovery of supplemental materials in BioMed Central and SpringerOpen journals in partnership with Figshare.
Registry of Research Data Repositories (Re3Data) lists over 1500 research data repositories, making it the largest and most comprehensive registry of data repositories available on the internet.
Authors will benefit from clear instructions on how to deposit the data that accompanies their article. If a data policy has specified that the data be deposited into a data repository, it may indicate that authors follow the guidelines for submission provided by the respective repository. If the policy advises data deposit into a publisher-managed repository, it should refer authors to instructions as to how they can deposit their data and what metadata needs to be provided. For example, PLOS provides information to authors on depositing data, including how to deposit data with a data repository integration partner.
It is recommended that data policies provide details to authors of when they need to deposit the data and associated metadata needed to validate the results presented in their publication. Generally, the data policy should require authors to provide data to the editorial office prior to the publication of an article. The exact timing of deposit will differ depending on whether the data is to be peer reviewed or not.
Data deposition coincides with article submission
Ideally, authors should make the data available with the initial submission of the article. This allows reviewers to assess the data if they wish, even if the data may not be public at that time. Data that has been deposited in a public repository at the time of submission can be issued with a Digital Object Identifier (DOI) and a citation for the data. The data citation – which includes the DOI - can be included in the article itself, either in the references, data or supplementary material sections. This workflow enables a clear link to be made between the article and the underlying data.
Data deposition integrated with article submission
Some journals integrate article submission with data repositories. For example, the Dryad Data Repository offers submission integration as a free service that allows journal publishers to co-ordinate the submission of manuscripts with submission of data to Dryad. It includes three options used by journals for when to submit data associated with an article.
If a publisher decides that supporting data is to be peer-reviewed then the data policy should specify this. Usually review processes and timeframes for data will be in line with, or linked to, conditions for the review of the manuscript itself.
Authors will need to know:
- when the journal requires their data for the purposes of review
- whether the data will remain confidential until the review process is completed and
- what the timeframe is for responding to the review.
Supporting data must be made available to editors and peer-reviewers at the time of submission for the purposes of evaluating the manuscript…...Some of these repositories offer authors the option to host data associated with a manuscript confidentially, and provide anonymous access to peer-reviewers before public release. These repositories then coordinate public release of the data with the journal's publication date. This option should be used when possible but it remains the author's responsibility to communicate with the repository to ensure that public release is made on time for online publication of the paper.
Editors have a role to play in recommending that authors do apply a license to their data and in referring authors to information about licensing models. Licensing data is essential so that those accessing the data know exactly what they can and can't do with it. Lack of clarity around use and reuse of data can have the same result as forbidding reuse of the data.
Springer Nature conducted research published The State of Open Data report (2016) in partnership with Digital Science, which clearly shows researchers would benefit from guidance on data licensing. The research found that of the researchers who have already made their data open, 60% of respondents were unsure about the licensing conditions under which they have already shared their data, and thus the extent to which it can be accessed or reused.
Creative Commons licensing
Ideally, data will have the least restrictive license and many datasets are licensed with CC-BY licenses. This is because the more open the license is, and the less conditions it has on it, the more available the data is for reuse. The ANDS website includes a section on Licensing for Data Reuse that includes links to Creative Commons and a variety of other resources including FAQs and ANDS licensing webinar recordings.
Data citation refers to the practice of providing a reference to data in the same way as authors routinely provide a bibliographic reference to outputs such as journal articles, reports and conference papers. The State of Open Data report (2016) and survey by Digital Science, Figshare and Springer Nature found that 80% of researchers value data citation as much as, or more than, article citation.
Data citation policies
Data Citation is an important component of a Research Data Policy and is included to:
- encourage and support data citation practices so that researchers receive credit and reward for sharing their data;
- provide a persistent and consistent method of linking articles with underlying data;
- facilitate easy access to the data;
- provide readers with a more complete picture of the entire research project.
In drafting a Data Citation Policy, a good resource to use is the Joint Declaration of Data Citation Principles. The Joint Declaration, launched in 2014, is a set of Principles for citing data and was a collaborative project involving representatives of publishers, data repositories and research institutions. Many publishers, for example Elsevier, have endorsed the Principles as an industry standard and incorporated them in production and publication workflows.
The Joint Declaration of Data Citation Principles states, “a data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community”. Best practice is to assign a Digital Object Identifier (DOI) persistent identifier to a dataset and to use the DOI in the data citation. Assigning DOIs to datasets and including them in the citation contributes to the long-term accessibility of the data and facilitates linking of articles and data. DOIs are also very useful in collecting data-specific metrics that can contribute to the reward incentive for data sharing authors. There are some international efforts underway to develop data-specific metrics and some publishers are developing products to support this, such as the Thomson Reuters Data Citation Index.
Data citation formats
Guidance for authors in how to format the data citation and where the citation will appear in the publication is important. It should also be clear that authors should cite the data they generated during their research and/or the existing data they used in their research. Useful resources in helping to guide researchers in data citation styles include:
DataCite, a leading global non-profit organisation that provides persistent identifiers (DOIs) for research data;
The UK Digital Curation Center’s How to Cite Datasets and Link to Publications states:
We recognise that the challenges associated with data publication vary across disciplines, and we encourage research communities to develop citation systems that work well for them. - DataCite
Where to include the data citation in a publication
In the journal article, DataCite: Lessons Learned on Persistent Identifiers for Research Data Project, the authors report that researchers are not consistently citing datasets in journal articles and when they do there is wide variation in practice. For example, data is cited in the main body of the article, in the notes section, within a dedicated section of the journal article or in the references section. Therefore, it is useful if a journal policy covers not only how, but also where, the data should be cited.
As with article, book and web citations, the dataset will be cited at the relevant place in the text of the article, and the reference will appear in the reference list, formatted in the same way as other references. Where possible, the reference will provide a direct link to the stored dataset, making it even easier for both reviewers and readers to access relevant datasets.
Citing software and related materials
Including reference to software citation in a journal policy contributes to reproducible research and a rewards system that values all of the components of a research project.
Resources that may help inform a data policy include:
- Software Citation Principles (2016) provides a draft guide and discussion paper which suggests that software should be considered a legitimate citable product of research and therefore scholarly credit must be awarded. They present the ingredients for software citation to support persistent access and accessibility.
- GitHub provides advice on how to make your code citable which involves minting a DOI for the code repository. While there are few ways to measure the impact of researchers who code, this is changing with the release of impact tools such as Depsy that seeks to “measure the value of software that powers science”.
ANDS provides a number of useful guides and resources on data citation that can be used to inform journal policy development, including:
- ANDS Data Citation Guide which provides an overview of why, when and how to cite data
- ANDS Digital Object System (DOI) for Research Data Guide which covers the DOI system, its advantages and use in data citation plus an introduction to the ANDS DOI minting service. The Guide is accompanied by an identifier decision tree.
A data policy that clearly states the consequences of non-compliance for authors will be more effective than one that does not. However, publishers will need to give careful consideration as to what the consequences of non-compliance may be and their capacity to enforce such consequences.
- contact details to help with compliance issues is beneficial, as some authors may require assistance in order to meet compliance, particularly if their data is sensitive.
- a method of monitoring compliance failure such as introducing a procedure for registering complaints about non-compliance.
- stating Publisher rights to post a correction to, or retraction of, the data following publication.
The PLOS policy on data availability states:
Refusal to share data and related metadata and methods in accordance with this policy will be grounds for rejection. PLOS journal editors encourage researchers to contact them if they encounter difficulties in obtaining data from articles published in PLOS journals. If restrictions on access to data come to light after publication, we reserve the right to post a correction, to contact the authors' institutions and funders, or in extreme cases to retract the publication.
In developing a data policy, editors are encouraged to:
- review the data policy advised by the journal publisher
- consult with the Editorial Board and engage Board members in the policy development process
- consult with discipline leaders to ensure the data policy is acceptable to the community
In terms of timing, editors should give authors sufficient notice of the new or revised data policy and communicate not only the policy but also the motivations for it. Be prepared to allow a wash in period for authors to become familiar with the policy before it comes into effect, for example a six-month period. Data policies should also be prospective rather than retrospective.
How can ANDS help?
- The ANDS website and the Journal Editors guide offer advice on issues around data management.
- ANDS is a co-chair of the Research Data Alliance Interest Group on Data Policy Standardisation and Implementation focusing on journal data policies.
- ANDS is coordinating various events that bring together publishers, editors, data facility providers, domain experts and researchers.
- ANDS offers a point of contact for journal editors and publishers seeking advice on the creation or enhancement of a data availability policy.