ANDS supports the development of institution-wide solutions for the discovery and reuse of research data collections. Having funded the development of a number of software solutions for creating rich metadata records about collections of data, the next step is to ensure that this metadata is properly managed so that it can be harvested and exposed to search engines as well as to researchers and research administrators. Metadata Stores are a key component of this infrastructure.
ANDS has supported metadata stores for data collections, with:
- connectors to institutional sources of truth,
- coverage over the entire institution, and
- feeds provided to Research Data Australia.
This guide is intended to provide an overview of the solutions that are being deployed at an institutional level. An appendix outlines considerations for deploying solutions with narrower scope. The guide is updated periodically as solutions mature.
Types of metadata stores
ANDS distinguishes between metadata stores by their coverage, the granularity of data that they describe, and the specialisation of their descriptions.
Based on coverage, types of metadata stores include:
- A local metadata store has coverage over data produced by a single instrument or research group.
- An institutional metadata store has coverage over data produced across the institution, typically by a variety of research groups and disciplines.
- A national metadata store has coverage over data produced across a country, by a variety of institutions. (Research Data Australia is an instance of a national store).
- A discipline-specific metadata store has coverage over data produced within a discipline, across a variety of research groups, institutions, and (typically) countries.
Metadata about research collections is best created and managed close to where the research data is created, in local metadata stores, tightly integrated with research groups and their activities. This metadata should be relevant to researcher needs, and easily accessible.
However, the metadata stores with broader coverage are essential if data collections are to be discovered, tracked and used outside the immediate context of the research - across a discipline or an institution. Stores with broader scope are likely to have more users than local stores, and institutional and national stores use more generic formats, applicable to more domains. Stores with broader scope typically act as metadata aggregators, gathering metadata (or appropriate distillations of metadata) from local systems.
Based on granularity, types of metadata stores include:
- A collection-level metadata store describes data collections (collections, datasets, etc).
- An object-level metadata store describes individual data objects (files, database rows, spreadsheets, physical objects).
- An integrated metadata store describes both individual data objects and the collections that they comprise, in the one system and is typically coupled with data storage for the data being described.
Based on specialisation, types of metadata stores include:
- A specialist metadata store captures metadata of interest to a discipline specialist.
- A generic metadata store only captures metadata which is of interest to a general audience; for example, university administration, university research office, general public, researchers in other fields.
The specialisation of a metadata store depends on who will be using it. Both are necessary: specialist metadata may be generated first (especially if automated), but is usually difficult for it to be repurposed automatically into generic metadata.
Institutional solutions tend to be generic, since their metadata descriptions cannot be discipline-specific. However, an institutional solution can be configured to provide different solutions for different disciplines.
Object-level stores are typically specialist, because discipline knowledge is needed to make sense of individual data objects. Data capture often produces specialist metadata automatically. If a specialist store is managing data objects and the discipline needs to organise those objects into a collection, it will usually do so as an integrated store, so that the management of objects and collections is co-located.
Institutions are different and have different needs and approaches. There is no single solution that fits all. Nevertheless, ANDS encourages its partners to consider deploying one of the developed solutions rather than duplicating development effort internally.
Descriptions of data collections should not be seen as information islands. They need to be connected to other kinds of information, which may be stored and managed in different data stores. For example, ANDS requires information about related parties and activities to accompany collections. The authoritative sources of truth for information about people can be HR and Research Office systems. A metadata store should be reusing that metadata, rather than creating its own records, with potentially inaccurate information. A characteristic of high-quality metadata is that it is created once and then re-used as needed.
If the contextual information is common across different institutions, it is appropriate to have a common external authority for the information. A common description of a grant or researcher across institutions allows users to navigate between data collections held by different institutions, but involving the same research team members.
This means that deploying a metadata stores solution usually involves integrating multiple sources of truth, possibly including external sources of truth. If such data has already been aggregated or centralised in the institution (e.g. as a data warehouse), it can be exploited by institutional metadata stores.
Mature Institutional Solutions
VIVO/VITRO (ANDS-funded project: EIF029, EIF002)
The project is using VIVO, a semantic web, triplestore-based approach to gathering and sharing research data. VIVO has been developed by a consortium in the US (originally Cornell University). The project is based on the code base, VIVO, and the VIVO ontology for describing research. This ontology has been enhanced to support ANDS requirements, and the enhancements (called the ANDS VITRO ontology) are being built as a community initiative involving several Australian universities. The ANDS VITRO ontology is extensible and more detailed than RIF-CS, and can be applied to a wide variety of purposes. The ANDS VITRO ontology is available here.
The VIVO approach provides an integrated University-wide view of research. VIVO came into being because there was a need to present views of research identity that crosses organizational boundaries, needed in the absence of established whole-of-University reporting practices in the US. VIVO is well-suited to institutions in which the research office takes the lead in implementing aggregation in collaboration with the library. VIVO can provide such institutions with detailed modelling of their research collections and researchers—e.g. in publishing researcher profiles across the institution.
As a semantic web–oriented product, VIVO is based on triplestore technology, which enables powerful SPARQL queries of metadata and benefits from inferencing capabilities. The VIVO approach also offers institutions the ability to create a whole of University Research Data Registry. More information.
The VIVO metadata store solution enables Linked Data approaches to research data, being RDF-based, but it is currently oriented to collection-level descriptions of data, and not more fine-grained descriptions.
The VIVO platform can handle both automated and manual feeds of research data into the triple store, from single or multiple data sources. As with all University-wide metadata aggregator solutions, building those feeds is still the responsibility of the deployer, and may involve significant effort in cleaning up the data, and in modelling the connections to outside data. The effort is substantially less if the institution already has a data warehouse in place. In any case, data must be mapped from existing data stores to the ANDS-VITRO ontology to be ingested by the system.
Code is already in place for converting the RDF of VIVO to RIF-CS and for providing an OAI-PMH harvest point from VIVO. EIF 002 produced Kepler workflows to automate populating VIVO for their metadata hub, as well as providing a harvester; code and documentation is available. An authoritative overview of this project is available along with a Vimeo video presentation.
VIVO is in use by the University of Melbourne (EIF029), Griffith University (EIF 002), the Queensland University of Technology (EIF 002), University of Tasmania, and the University of Western Australia. Of these deployments, most are using VIVO as an interface and export tool: the deployments are ingesting mapped research activity data that is stored and managed in more traditional formats, in institutional data silos (Oracle, Mediaflux, Research Master).
ReDBox: Research Data Box (ANDS-funded project: EIF040)
This solution uses its own instance of the Fedora-commons data store to store and disseminate metadata on research collections. The store uses as its front-end the Fascinator faceted search software developed at the University of Southern Queensland.
The RedBox solution takes an Institutional Repository approach to research metadata: metadata is collected through user forms as an interface to the repository, as well as automated integration of the repository with other campus systems. Metadata already collected in the repository is repurposed for disseminating research data. The solution is well-suited to institutions which already have a strong repository presence, with established work practices for repository management (so that the repository is enhanced by deploying the solution), and in which the library takes the lead in implementing aggregation.
Metadata can be added to the system either manually or via automatic harvesting from other systems. The manual entry of research metadata is supplemented by alerts issued by interfacing systems—including grants databases and disk storage: these point repository maintainers to new instances of research data to be processed. The metadata is internally stored using the VITRO ontology, in which the ReDBox project are also stakeholders. This means that the solution offers semantic consistency with institutions using VIVO/VITRO; but deploying and using the software does not require developing semantic web skills.
The ReDBox solution also includes the "Mint" infrastructure supporting controlled vocabularies used in research metadata, and treating them as Linked Data. The Mint allows validation of data entered by users in the forms interface, and it is also how ReDBox deals with party and activity identifiers. The use of unique identifiers ensures data integrity for the records they identify.
Because the solution is repository-driven, it can support description of individual resources as repository objects. The repository can also store descriptions of data collections which themselves are stored remotely (e.g. in a large disk array) or can be used to house data collections themselves. Being Fedora-based, the solution has built-in support for OAI-PMH, and for multiple metadata schemas describing the same object.
ReDBox has been taken up at the University of Newcastle, and will be taken up at Flinders University. See a list of ReDBox users at the ReDBox site. Project development on ReDBox 1.0 is now complete. The Queensland Cyber Infrastructure Foundation was funded until June 2012 to help deploy ReDBox, and to develop fixes and enhancements to the solution (RedBox 1.1).
QCIF has worked with ANDS to develop and productise the open source ReDBox Software originally developed with ANDS funding at the University of Southern Queensland (USQ). While ReDBox Software is an open source product freely available; research organisations deploying ReDBox may require access to software Support Services. See QCIF Service Agreement and Schedule of fees (v1.0).
MyTARDIS (Squirrel: ANDS-funded project: EIF019; MeCAT: ANDS-funded project: EIF020, EIF037)
The Squirrel and MeCAT projects are both extending the MyTARDIS codebase for use as an institutional metadata store. MyTARDIS was initially developed for storing datasets and metadata in protein crystallography. The code is now being made more versatile to easily fit in with other discipline-specific and generic approaches to research data management and reuse. The system under development allows researchers to organise, describe, find, reuse and share their data, which is stored in a central data store.
The two projects are coordinating their work, to ensure that the codebase remains common between the two. We refer to the projects together as TARDIS systems in the following.
The Squirrel project involves Monash University. Squirrel includes a schema registry allowing users to define their own metadata schemas to describe their data. Squirrel aims to support the self-deposit, by researchers and research support intermediaries, of discipline-specific metadata and the descriptive and administrative metadata which needs to be provided to research offices, libraries, records and archives, and Research Data Australia. It provides integration with externally stored information about parties (e.g. researchers) and activities (e.g. grant-funded projects) through web services. Web services for Monash are being provided by another ANDS-funded project (EIF 038), and would require local development for any new deployment.
The MeCAT project is extending MyTARDIS for deployment on the Infrared and Small Angle X-Ray Spectroscopy beamlines at the Australian Synchrotron (EIF 020), and five beamlines at the Bragg Institute, which is part of the Australian Nuclear Science and Technology Organisation (ANSTO) (EIF 037). The enhancements being made to MyTARDIS will enable users to search and download data and metadata from the facilities and assist beamline scientists in managing data from their beamlines, supporting users and improving beamline operations. These enhancements include:
- storing more detailed information on the equipment being used to conduct experiments and the samples being analysed at the facilities
- extending the authentication and authorisation capabilities of MyTARDIS to provide more fine grain control over who can access data
- extending the search capabilities to work with scientific data in multiple disciplines
- detailed logging and audit trails for tracking access and modification of metadata.
TARDIS systems will include an independent OAI-PMH provider, meaning several complementary dissemination models can be supported, including:
- direct harvest of RIF-CS for registering collections with Research Data Australia and/or
- provision of discipline-specific metadata to a discipline-specific portal (e.g. TARDIS in the case of crystallography) and/or
- transfer of metadata to other repositories or aggregators that need information about research outputs, to meet institutional goals around research assessment/impact, and compliance with legislation and protocols for record-keeping and responsible research.
TARDIS systems are aimed at facilitating research data management and reuse, and are not intended to function as a metadata aggregator.
TARDIS systems are currently being taken up at: Monash University, the Australian Synchrotron, ANSTO, the Ian Wark Institute, and the Royal Melbourne Institute of Technology. Squirrel project development commenced in February 2010, and is scheduled to run until August 2011. MeCat project development commenced in March 2010, and ran until December 2011.
ANDS ORCA Registry
ANDS has internally developed ORCA as a metadata store for managing the RIF-CS records that are collected in Research Data Australia. ORCA is set up to provide OAI-PMH feeds of the RIF-CS records it stores, and also has authoring support for RIF-CS. That means ORCA adequately supports the narrow goal of authoring and disseminating RIF-CS records.
However ORCA does not provide the broader support of research data management or integration with external data that ANDS sees as desirable in metadata stores. For that reason, ANDS does not encourage using ORCA as a substitute for deploying a fully-fledged metadata store.
Currently ANU is planning to augment ORCA for use as an institutional metadata store.
Geonetwork (ANDS-funded project: EIF023)
Geonetwork is an open source catalogue application to manage spatially referenced resources. It provides powerful metadata editing and search functions, as well as an embedded interactive web map viewer. It is currently used in numerous Spatial Data Infrastructure initiatives across the world. "GeoNetwork has been developed to connect spatial information communities and their data using a modern architecture, which is at the same time powerful and low cost, based on the principles of Free and Open Source Software (FOSS) and International and Open Standards for services and protocols (a.o. from ISO/TC211 and OGC)".
Geonetwork is targeted as a researcher-oriented metadata solution. Deployments of Geonetwork should couple it with a repository for storing the data described by the metadata; the standard implementation includes data storage, but is not very robust. Oracle is commonly used, but PostgreSQL and MySQL have both been used successfully.
Geonetwork has been implemented by a number of Australian public sector agencies, with enhancements. In particular, the BlueNet MEST is an enhanced version of GeoNetwork 2.2. Amongst other things, these enhancements provide support for the profiles of the AS/NZS-19115 geographic metadata standard - Marine Community Profile (MCP).
The AODN (Australian Ocean Data Network) ANDS project (EIF023) uses the AODN MEST, which is based in turn on the BlueNet MEST. Geonetwork can be used as a harvester to pull together records from other organisations using the same tool: the AODN MEST (Metadata entry and search tool) currently harvests 9 other Geonetwork MESTs.
The application of the BlueNet MEST as the Australian Government Metadata Entry Tool, and the ANZLIC Metadata Entry Tool, is the result of collaboration between the former Australian Office for Spatial Data Management, ANZLIC: the Spatial Information Council, GeoScience Australia, and the BlueNet project.
A crosswalk from the ISO19115 MCP to RIF-CS has been implemented. The AODN have implemented an OAI-PMH harvest point from the AODN MEST into Research Data Australia.
Appendix - Local Metadata Stores
The Metadata Stores program is not directly funding local metadata stores, specific to a research group. It does fund integrated metadata stores (Squirrel, MeCAT) which can be configured for discipline-specific use; but these are still institutional stores: they can be tightly coupled with instruments and collection production, but they can also be coupled more loosely.
However, local metadata stores are crucial to good data management and to populating broad-scope metadata stores. The Data Capture projects funded by ANDS often involve setting up a local metadata store, specific to the instrument, for that reason. ANDS cannot recommend metadata stores for specific disciplines or projects; however, researchers should consider the following requirements for their local stores.
The local metadata store should:
- Store metadata that supports discovery and evaluation of data (e.g. keywords).
- Store metadata in a format which is in common use in the discipline.
- Store metadata that supports reuse of data (e.g. experimental configuration, interpretation of dependent variables, access rights-these may simply be a link to a separate file or a paper).
- Export metadata to other formats commonly used in describing metadata, especially in metadata aggregators (note that OAI-PMH requires a feed to be available in Dublin Core).
- Support aggregation of metadata (harvesting and/or syndication)-especially for international discipline repositories.
- Support automated gathering of metadata from instruments (e.g. file header), and of related metadata from other databases (e.g. HR systems, grants programs).
- Integrate in researcher workflows with minimal disruption (e.g. through web services).
- Allow error checking, validation, and use of constrained vocabularies.
- Allow metadata describing both collections and objects within collections, if that is appropriate to the discipline.
- Allow hierarchical organisation of metadata, where appropriate to the discipline (e.g. ordering metadata by project and/or experiment).
Not all metadata store solutions will satisfy all requirements; automated metadata gathering and integration, in particular, are not widespread, and should not automatically disqualify a candidate store. All these features are worth considering in evaluating candidates, and researchers need to work out which features are priorities for them. The highest priorities are likely to be commonly used formats, hierarchical organisation, and aggregation support.