ANDS has established a national registry of data collections and a related discovery portal called Research Data Australia (RDA). It is designed to allow researchers and research organisations to publish the existence of research data and to allow prospective users of that data to discover it and evaluate its possible applicability to new research.
By establishing national infrastructure to support standard definitions, reference values, and identifiers for these common entities and by supporting contributors' use of these standard terms or identifiers when describing research data collections, additional connections between data collections can be discovered through the RDA service. Better and more standardised definitions also allow programmatic interfaces to manipulate data, link data, and aggregate data on the super‐human scales required for contemporary research.
ANDS is collaborating with other national agencies to establish an underpinning informatics infrastructure that improves the potential coherence and integration of research grants, projects, data and publications. These projects are described below.
What is the data connections strategy?
The basic information model for the ANDS Registry is the international standard ISO 2146:2010 Information and documentation – Registry services for libraries and related organizations. This describes a federated registry service that contains descriptive and administrative metadata not just for collections, but also for related services, parties and activities and the relationships between them.
The data connections strategy builds on this approach by incorporating common referencing methods for researchers, research groups, research activities, places, research datasets, research fields, and scholarly or scientific terminology.
Strategies used include:
- providing web pages indexable by large search engines such as Google and Yahoo
- exposing descriptions of research data collections via a number of standard query, harvest and syndicate web services, and publicising these services so that they are consumed by other portals and mash-up services
- constructing a mesh of inter-linked information about data collections designed to provide "discovery in context" through supplementary information on the people, organisations, research activities, and services related to the data collections
- leveraging these strategies in the RDA portal, a specialised window on the Australian research and innovation sector that is expected to be particularly useful for solving cross-disciplinary problems such as minimising the impact of climate change
- linking to other portals that may be focused on particular disciplines or types of data to support more nuanced and discipline‐specific discovery.
Connecting data through people and organisations
For those searching for datasets relevant to their research, the researcher or research group is often a strong indicator of possible relevance and a measure of quality. However, multiple forms of a researcher's name can be used over time, and multiple researchers can share the same name. Even if the provider of dataset descriptions has used a locally unique identifier to reference a researcher, they often have multiple affiliations and other institutions will refer to the same person in their dataset descriptions with a different identifier. A common public identifier is needed to bring together research datasets that have a researcher or research group in common.
This longstanding problem in the scholarly information ecosystem has been addressed by scholarly publishers implementing proprietary researcher identity systems, such as the Elsevier's 'AuthorID' and the Thomson Reuters 'ResearcherID'. Although these systems have incomplete coverage for our domain and are not open for machine access, they are important identifiers, and the public identifier system chosen for Australian researchers will need to integrate with global researcher identity systems such as these, and ORCID.
ANDS supports the use of ORCID identifiers for researchers.
The National Library of Australia (NLA) has extended its existing Trove—People and Organisations infrastructure to harvest party information (people and groups) from Australian universities and other research institutions, and to improve automatic and manual identity matching services. ANDS promotes the use of this infrastructure by institutions supplying dataset descriptions to its registry. The party information provided is considered to be part of the public profile for this researcher and will be displayed in both Trove and RDA. The NLA use auto-matching techniques to link ORCIDs to the corresponding NLA Party Identifier for the same researcher.
Connecting data through research grants and projects
The description of the research grant and/or project that generated a dataset is useful for promoting discovery and reusability. It may be a long-term project with various components that cross disciplines and institutions. It is important that the descriptions of datasets, whoever supplies them, reference the related research project using a common, unique identifier. The description of the research grant that funded the project can be used as a proxy for the project itself and this authoritative grant information is supplied by the major funding bodies.
Australian research output is predominantly funded by research grants from two Government funding agencies, the Australian Research Council and the National Health and Medical Research Council. Universities, research institutions and other funding bodies also fund projects directly, and ANDS has a program for obtaining grant information from more funders over time.
During 2015, ANDS developed an online discovery service about research grants, Explore Grants and Projects as part of RDA. This infrastructure supports machine access using standard protocols for query, harvest and linked data. It also provides persistent, unique, citable identifiers for each research grant.
Additionally, institutions that have tagged publications in their institutional repositories with the above grant identifier will see their publications also linked to the grant in RDA, provided they supply their publication metadata to the NLA's Trove service.
ANDS will assist research institutions to develop local infrastructure to integrate this definitive source information into their information systems. In this way interfaces for describing datasets will be able to reference the research projects that generated them, and providers of dataset descriptions to the ANDS Registry will be able to use a common identifier for research programs and projects.
Connecting data through place names and locations
An important goal of the Australian Research Data Commons is to enable cross-disciplinary discovery of related research data, and spatial location is a vital linkage mechanism in this process. The value of the data commons will be increased if the dataset descriptions include spatial coverage data encoded as geographical points or polygons rather than just text. Non-GIS-experts from arts, humanities, and science need the ability to enrich their dataset descriptions with standardised spatial information.
Achieving this goal required the establishment of a robust national infrastructure that allows place names to be validated efficiently by both individuals and software systems against an Australian Gazetteer service. There will need to be distributed sources of gazetteer data, depending on jurisdiction, feature types, temporal coverage and language.
In Australia, the authorities for geographic feature place names and their geospatial location are state and local Governments. This data has been aggregated by Australia's National Mapping Agency, Geoscience Australia, which produces the National Topographic Maps series for Australia.
The movement to Creative Commons Licensing for publicly funded data has created an environment in which ANDS has been able to fund a project with the Office of Spatial Data Management to create an Australian Gazetteer service. The service provides open access to geospatial data via a search interface, data downloads and a standard web service interface called WFS-G Open Geospatial Consortium Gazetteer Profile of Web Feature Service. This enables the design of user interfaces to allow spatial coverage to be entered as place names and then converted to geospatial coordinates.
Connecting data through scientific and scholarly terminology
Controlled vocabularies are widely used to better organise and describe knowledge by standardising the use of language in metadata descriptions. The development of the SKOS (Simple Knowledge Organization System) standard, and the progressive improvement in underlying Semantic Web technologies, provides scope to improve the way that scientific knowledge is organised and linked.
Discussions are being held with both stakeholders and potential providers of such services to establish an inter-operable network of machine-accessible vocabulary services. Local systems will be able to dynamically access these vocabularies, so that user interfaces can present easy lookup features using the current versions of the vocabulary.
The promotion of the use of standard descriptors for research datasets will improve the discovery of datasets relevant to the work of the researcher.
Connecting data through fields of research
The Australian Bureau of Statistics and Statistics New Zealand have developed the Australian and New Zealand Standard Research Classification (ANZSRC) to describe fields of research and other terms and categories related to research funding and outputs. These classifications are widely used within the research sector in Australia for research reporting and assessment.
ANDS is partnering with the Australian Bureau of Statistics to make this classification system available via a vocabulary service as described in the previous section. This will enable local systems used by data providers to the ANDS registry to include these classifications in user interfaces for describing research outputs. This will improve discovery and precision filtering of retrieved data collection descriptions through discovery services such as RDA.
Connecting data through data citation
Although the Australian Research Data Commons is focused on aggregating descriptions of data collections rather than publications or other scholarly outputs, the user community needs discovery services that combine both. By making datasets citable through a common standard such as the Digital Object Identifier (DOI) system, relationships between publications and datasets can be exploited in discovery systems.
To promote the citation and reuse of Australian research data, ANDS is providing a DOI name service for research datasets as a free service to Australian research institutions, minted through the ANDS Cite My Data DOI minting service on behalf of DataCite consortium.
To obtain a DOI, a minimum level of metadata is required, and this will be stored in the ANDS Registry as part of the process of minting a DOI name for a dataset. RDA features DOIs for datasets in its interfaces and encourages their use for citation purposes for example.
Connecting data to data storage
If a data collection is open, it should be made clear in its metadata record where the collection can be downloaded and accessed. If a collection has a direct downloadable URI, the URI can be included in RIF-CS location element for example or the metadata record on a landing page. If a collection is not made available on line, the metadata record should provide contact details for where to source the data.
Connecting data to data
Very often, a collection is an output of a transformation that takes some other data collections as an input. In this case, describing both input and output collections and linking them will increase discoverability of both collections and accountability of the output collection for example.
RIF-CS provides a vocabulary that describes various relationships between two data collections.
Connecting data to services
If a data collection is an output of a model or a transformation or can be visualised through a software tool, linking the collection to the software service will make the collection more accountable or reuseable.
RIF-CS provides a vocabulary that describes various relationships between a data collection and a service.
Exploiting the connections
The goals of the ANDS data connections strategy are:
- To link data through shared entities and concepts, wherever researchers and the interested public are looking for it.
- To exploit these linkages in the RDA discovery portal to create a rich mesh of inter-linked information about research data collections.
These new data connection capabilities will support better discovery of research data collections.
- A discovery interface needs to 'prompt serendipity' which all use the shared concepts and entities described above; this can be done in three ways:
- clustering search results (directed searching or faceted searching)
- providing different ways to browse though the data collections via shared entities and concepts (additional entry points)
- providing links to other datasets which are related to a common entity or share a concept.
- For cross-disciplinary discovery, searchers need extra clues to lead them to possibly relevant datasets. Spatial location is an especially useful linking mechanism for cross-discipline searching. Map-based searching can discover data across a wider variety of disciplines if geospatial locations have been included in descriptions (with the help of the Australian Gazetteer).
- Discipline-specific portals provide optimal discovery for research data that fits within the coverage of that portal. This is because the metadata will be richer and the granularity finer than in national portals like RDA which have no particular discipline focus. The efficacy of these portals will also be enhanced by the data connections infrastructure.