Skip to content

Search for Research Data

Search the ANDS Site

Search
Search
http://www.ands.org.au http://www.ands.org.au

Persistent identifiers: Expert level

Share
Share

1. Introduction


This module aims to provide research administrators and technical staff with a thorough understanding of the issues involved in setting up a persistent identifier infrastructure.

  • It provides an overview of the types of possible identifier services, including core services and value-added services
  • It describes the services available as part of the ANDS Identify My Data Product
  • It offers a comprehensive review of the policy issues that are involved in setting up persistent identifiers
  • Finally, a glossary captures the underlying concepts on which the policies and services are based.

1.1 Identifier services

To interact with identifiers in software, identifier services are used, which are brought into effect through an identifier management system. Identifier services may offer:

  1. Curation servicesThese services create, update and delete identifiers.Note that deleting persistent identifiers should be avoided.
  2. Lookup servicesThese services use identifiers to access the identified items, or provide information about the items.

Additionally, it is possible that in the future a persistent identifier service could provide information as to whether an object has changed since its identifier was created.

1.1.1 Identifier Standards

There are several persistent identifier standards that provide the basis for online identifier services, these include:

There are also many informal standards for content-based persistent identification.

ANDS Online Services which mint DOI and Handle Persistent Identifiers are based on The Handle System.

1.1.2 Creating identifiers

Creating identifiers involves an association of a label and a name to produce an identifier.

A name is an association of a label (a symbol) with a context that the label is in. Contexts define how the label is to be made sense of. There is only one instance of a given label in any one context. For instance, an early episode of the original Star Trek television series was called  (labelled)  'The  Enemy Within'. Within the context of this series, this label is unique and is a name. 'The Enemy Within' was also the label for an episode of Stargate SG-1 and for a comic book series related to the Terminator movies. The label 'The Enemy Within' is only helpful in determining precisely  what  the  object is when we know its context.

The process of creating identifiers can be broken down into the following steps:

  1. A label is created. This can happen in a number of ways, automated or not. The data provider can suggest a meaningful label for the identifier; the label can be generated through an arbitrary process (e.g. timestamp, random number generator, creative effort); the label can be derived from some properties of    the    thing being identified.
  2. A name is created. For identifiers being managed by an identifier management system, the context for names is provided by the identifier system, so creating a name involves registering the label as a potential identifier in the identifier system. This helps prevent collisions, in case the label is already used    for    something else in the identifier system. At this stage, metadata about the name can also be registered (who registered the name, when).
  3. An identifier is created. Because an identifier is merely the association of a name with a thing, creating an identifier is equivalent to making that association explicit. For some identifiers, the association is implicit in the label, because it is meaningful; e.g. when used strictly as a locator,    a    URL    label describes a network path: whatever is at that path is (or describes) the thing identified by the URLs. For other identifiers, the association must be recorded explicitly in the identifier system, through information about what thing is being identified. For web accessible resources, the association is    often    a mapping between the identifier's name and a URL. This is typically the kind of information used in resolution.

While these three identifier creation actions often happen simultaneously, these may be broken up to deal with constraints on the workflows around the data itself. For example, an organisation may choose to register a block of names in advance. Or, a data provider may embed an identifier in a data  object  when  creating it (which is good practice for persistence), but will later provide an association (e.g. a URL) to register with the identifier when the object is actually published. In that case, the provider would need to register the name first; create the object with the embedded name;  publish  the  object online to get a URL, and only then register the URL against the name, creating the identifier proper.

1.1.3 Updating identifiers

The information registered with an identifier can keep being updated even after the identifier is made available; identifiers relying on indirection (discussed below) periodically update the URL associated with the identifier, without updating the identifier name. Keeping the information about the  identifier  up  to date is a key part of 'managing the identifier'.

Some updates are constrained by policy: the identity of who created the identifier cannot be changed without falsifying the record, and the label of an identifier cannot be changed without causing confusion. Even if the URL for an identifier changes, a persistent identifier should not end up associated  with  a  completely different object: this defeats the purpose of having the identifier be persistent.

1.1.4 Publishing identifiers

As with data objects, publishing an identifier means that the identifier is made available to users who are not already managing it. Publishing identifiers involves crossing the curation boundary. The notion of a curation boundary is explained in the diagram below and is useful in understanding  how  identifiers  interact with end users: it defines what it means to publish data.

The curation boundary model defines publication to be when the data is exposed to people not involved in managing the data, i.e. when it crosses a curation boundary. For example, a research collaboration can have geographically widespread access to a resource, and edit it frequently as part of their work. Once a copy of the resource is available outside that group, to users with read-only access, it is expected to be reasonably stable,  and  straightforward  to locate. As part of its stability and to establish trust, any modifications to the resource should be documented, as change metadata. The resource is then published. (This compares to the distinction between alpha releases of software and official software distributions.)

A persistent identifier makes more sense for data which has crossed the curation boundary (is publicly cited by a large number of people, is stable, and will change network location only as a well-defined object). It may be less urgent for data still in flux and only accessed by a small number of active  users.  Consequently,  the responsibilities of the identifier manager are greater once the object crosses the curation boundary.

This may not mean the identifier is available to the general public; any read-only access to an identifier counts as publishing it, even if it is subject to authorisation.

Since an identifier is a complex object, different aspects of an identifier could be published at different times. A user may know what the name of the identifier is, but not be authorised to do anything with that name online; a user may be able to find out when the identifier was created, but not  authorised  to  find out who created it.

NOTE: When we speak of publishing an identifier, however, we usually mean authorising users to resolve the identifier (as we define it below).

Publishing an identifier is normally synchronised with publishing the data it identifies. However, there are circumstances when they are published separately.

For example, if the data object is under embargo, it cannot be retrieved through the identifier; but the identifier name can be made public in advance, e.g. in a paper under review (so that it does not have to be changed or added in when the object is published). In that case, the identifier may not  yet  be  allowed to resolve (the URL it would redirect to is not public), and the identifier resolution has not yet been published. Alternatively, it may resolve to limited information about the object, as opposed to downloading the object itself.

1.1.5 Using identifiers online

Ways in which identifiers can be used online are discussed below. If the identifier is persistent, then its manager needs to keep the information registered with the identifier up to date; that keeps its online use accurate.

1.1.6 Archiving identifiers

It may become impossible or impractical to update an identifier. In that case, an identifier can be archived. This can mean freezing the identifier information, so there is no longer any expectation that it will be updated; users should also be warned that the identifier may be out of date. Alternatively,  aspects  of  the identifier may be withdrawn from public access, particularly the ability to resolve it. (Note that the name of the identifier cannot be withdrawn from public access, because users already know that the name exists.)

Archiving an identifier may not synchronise with archiving (or deleting) the thing it identifies. Persistent identifiers are expected to outlive the objects they identify, for historical use. Even after a data object has been deleted, scholarly literature or the Web may continue to point to the object  through  an  identifier; the identifier can continue to be useful by giving information on what the object used to be. An identifier may also need to be archived if the object continues to exist, but the identifier can no longer be kept up to date, e.g. if the object is managed by some other party, it  may  not  be possible to keep the object's identifier up to date.

1.1.7 Deleting an identifier

An identifier can be deleted, by removing the record of the identifier from the identifier system. This does not 'destroy' the identifier: anyone who knows that name used to identify something has a mental record of the identifier. But deleting it does let the identifier name be reused to identify  something  else  on the same system. If an identifier is persistent, it is only ever expected to identify one thing, so it cannot be reused. This means persistent identifiers should not be deleted from identifier systems and identifiers should never be reused, even if they will no longer be available publicly.

1.2 Resolution and retrieval

An identifier can be used to name a thing. In order for the audience to be able to understand the identifier, they must already have a shared understanding of what is being named.

1.2.1 Resolution

If there is no shared understanding of the identifier with its audience, a resolution service is required, which maps identifiers onto things. In the broadest sense, resolving an identifier is getting information about the thing identified, to help distinguish it from other things. For example, a person's name could resolve to a listing of various identifying characteristics  (like  date  of birth), which can be used to distinguish that person from all others. A name for a publication could resolve to its bibliographic citation, which can be used to distinguish that publication from all others.

1.2.2 Offline and online resolution

Resolution does not require that the identifier be a digital object itself, so offline identifiers (such as personal names) can be resolved too, such as by consulting a reference book. However, when the identifier is a digital object online, resolution generally means an online service which returns  metadata  about  the object named. This metadata distinguishes the object from all others.

Similarly, the resolution of an identifier can take place online, while the thing being identified is offline: digital identifiers are not restricted to identifying online content. For instance, 'http://person.example.com/johnston/fred' can be resolved to a page listing Fred's contact details and description.  The  identifier  is an online URL, and the page accessed is online metadata about Fred, but Fred himself is not an online object—Fred is not his website. However, the most common case is that the identifier, the resolution and the thing being identified are all online.

Uniform Resource Locators (URLs), which are usually web page addresses, are one type of Resource Identifier (URI). Using URLs to identify offline objects is problematic. We have brought up the concern that 'Fred is not his website' — an online identifier for a person (or any offline entity) should  not  be  confused with its online representation. This confusion has been longstanding with URLs, because they traditionally conflated resolution and retrieval, so that 'http://person.example.com/johnston/fred' could only identify a web accessible resource, not Fred himself. To deal with this, a Uniform  Resource  Identifier  (URI) is now allowed to resolve to an online resource, without that resource being exactly what the URI identifies. The URI is an abstract identifier, resolving to metadata, rather than a locator. So http://person.example.com/johnston/fred can be made to identify Fred, though it downloads  a  web  page about Fred. The distinction is made with different HTTP status codes, or through attaching #-fragments after URIs.

1.2.3 Retrieval

In general, we expect that clicking a URL will let us download the object itself. However, two distinct actions are taking place: getting distinguishing information about a thing is resolution, whereas getting to the thing itself (or a representation of it) is retrieval. The two activities are typically bundled together when you click on a URL in a browser, which is useful; but they can be logically separated out if necessary.

For example, consider a URL: 'http://www.example.com/paper.pdf'. When requested, the website could resolve the identifier to locate the paper being requested, and immediately download it to the user as a PDF, combining resolution and retrieval. Or it could resolve it to a splash page (created in PDF)  about  the  paper, which provides bibliographic data as well as a link to the paper for download. Both options are valid and widespread on the web at present.

One benefit of separating resolution and retrieval is that if the object goes offline, clicking the URL can instead show information about what the object used to be. For example, the URL 'http://arxiv.org/abs/gr-qc/0609101' provides resolution of a scientific paper. It no longer provides a link to  the  paper  for retrieval however, as the paper was later found to have plagiarised other work.

More information

1.2.4 Multiple resolution

Once retrieval and resolution are decoupled, multiple resolution becomes possible: an identifier can resolve to a page with several kinds of links, such as downloads from multiple locations or in multiple formats, as well as links to further information or services, such as purchasing the item in hard format. An intelligent resolver can use information about the user and their context to assist this process, offering the right download for the user's operating system or browser, for instance. This is already common practice for open source software and shareware.

An intelligent resolver can use information about the user and their context to automatically select a mirror copy to deliver; that is an Appropriate Copy service. A resolver can go further and provide an Appropriate Version service, selecting a download based on language, file format, or accessibility format.

1.3 Value-added services

In addition to the core services listed above, value-added services can be introduced to satisfy the requirements of particular communities. (Arguably multiple resolution is also a core service, but intelligent resolvers providing appropriate copy and appropriate version resolution are certainly value-added.)

1.3.1 Guaranteeing persistence

To guarantee the persistence of identifiers, two types of services to verify identifiers are required. A link rot check service verifies that any URLs resolved to are still live. An association check service ensures that those URLs point to the correct resources, and that the resources at those network locations have not been replaced by something else. These services are important for  identifier  managers,  but also to end users, to establish trust in the identifier system.

1.3.2 Archival resolution

Any online resource can end up no longer being actively maintained. Since an identifier is an online resource, it too can end up no longer being kept up to date. This is typically due to institutional changes, where the maintainer of the item is somehow not in communication with the identifier manager.  As  a  result, the identifier manager can no longer find out the current URL of the resource, for the identifier to resolve to.

One way of dealing with this is by having the identifier resolve to the last known URL of the resource. Another is to provide contact details for the resource's current manager, or the last known identifier manager, so that users interested in accessing the resource can contact them directly; this  could  apply  if the identifier manager is no longer active, and no one else has taken the identifiers over. These are all instances of archival resolution services: the identifier is no longer being maintained, so it can be considered archived.

1.3.3 Relationship resolution

Capturing the relationships between various entities can also be treated as a persistent identifier service, mapping between the identifiers of the related entities. This is especially useful if the relation is between abstract entities where citing a URL associated with a specific copy of the resource  would  be  misleading. For example, given the identifier of some intellectual property, a derivative work service can return the identifiers of works known to be derived from it. The relationship is not merely between particular instances of those works; it involves any file containing that intellectual property, as represented through abstract identifiers rather than concrete file  locations.  Relationship  services can also include versioning services, which return a particular version of a file given an identifier encompassing all versions of the file.

1.3.4 Annotation service

An annotation service can attach metadata to a resource or parts of a resource, wherever it happens to be stored, through its persistent identifier. This allows annotations to be attached reliably by a third party, and to be accessible in context over the long term, without depending on updates  about  where  the annotated resource has since moved. The persistence of those annotations is more secure if the targets are themselves identified through persistent identifiers — especially since the party creating the annotation may have no control over how the location of the target resource may  change.

1.3.5 Information hierarchies

Relationship services can also be used to navigate through information models for a domain, which include abstract entities. The FRBR model, for instance, is used in libraries to relate copies, formats, editions,  versions  and  adaptations of literary works; an entire such structure can be navigated through identifiers for the various levels of abstraction, until a concrete entity (a physical book or file) is reached.

1.3.6 Citation tracking

Finally, if the persistent identifier is used to cite a research output, citation tracking can be treated as an identifier service, scanning for instances where the identifier has been mentioned. Cross-Ref's Cited-By Linking service, which relies on tracking  DOI  persistent  identifiers, is an example of such a service. Using persistent identifiers has the advantage of avoiding specific file locations, so use of a research output can be tracked through its lifecycle, wherever it happens to be stored. On the other hand, there can always be more than one identifier  used  to  cite a resource (including the current local URL, if it is exposed to users); so a citation tracking service is not guaranteed to pick up all existing citations.

2. The Handle System

2.1 Introduction

Handle technology, developed by the Corporation for National Research Initiatives, has been widely deployed in the repository community. This section gives an overview of the Handle system and how it is used by the ANDS Identify My Data product.

NOTE: The ANDS Identify My Data product is a Persistent Identifier Service (PIDS) The underlying service functionality is based on the Handle system. It has a machine-to-machine interface offered as a web service, and an online human facing interface offered though a web browser.

2.2 Handles and namespaces

A handle consists of two parts: a naming authority and a label unique within that naming authority. The two parts are separated by a slash ('/'). Naming authorities themselves can consist of different parts, separated by dots ('.'); unlike DNS, this does not imply a hierarchical structure of authorities.  Any  other  UTF-8 characters are technically permitted in both the names of naming authorities and local names, but in practice, naming authorities tend to be numeric.

For example, '10.1045/january99-bearman' is a handle under the '10.1045' naming authority.

Handle technology allows specific handle names to be requested and allocated.

Sometimes people cite URLs like 'http://hdl.handle.net/102.100.100/15' calling it a 'handle'. It is important to note that the handle here is only '102.100.100/15'. Also, note that hdl.handle.net is not, strictly speaking, a 'handle server' but rather, a 'handle proxy server'. See §2.5 'Handle Proxy Server' for details.

NOTE: ANDS Identify My Data allocates handle names on behalf of the user.

More information

2.2.1 Handles and namespaces

The ANDS Handle namespace is 102.100.100. This is made up of '102' (Australia) dot '100' (e-research) dot '100' (ANDS).

Handles allocated by ANDS are numerical values in sequence within this namespace. ANDS PIDS handles therefore look like '102.100.100/15'. A resolvable URL for an ANDS PIDS handle looks like 'http://hdl.handle.net/102.100.100/15'.

2.3 Handle Server

A handle server simply associates metadata with a handle, and returns that metadata when requested by a call to the Handle Service. The kinds of metadata associated include URLs, text descriptions and ownership information.

The handle server listens (usually on port 2641) for requests made using the 'Handle System Protocol'.These requests include behaviour such as handle administration (creating, updating and deleting handles), handle queries (returning metadata associated with a handle), and authentication.  It  is not  directly usable by end users.

NOTE: All interactions with the ANDS Handle server will either take place via a proxy server ('resolver') such as hdl.handle.net, or through the ANDS Persistent Identifier Service, discussed below.

More information

2.4 Handle Client

Handle Servers do not come with a web interface. Therefore a specialised application is required to enable users to interact with them. Two such applications (one command-line, one Java GUI) are included with the Handle Server software, and libraries exist to allow other applications to be written.

NOTE: ANDS offers a set of web services and a browser-based client interacting with The Handle System (ANDS PIDS, ANDS Self-Service Identifiers). These are discussed further below.

2.5 Handle Proxy Server

As a convenience to users, The Handle System comes with a simple web server, called the Handle Proxy Server (also known as 'the HTTP interface' or 'the resolver'). This is a separate piece of software from the handle server itself, but as it provides a user interface to the handle server, the two are  frequently  confused.

The proxy server provides a resolution service, taking a handle, and providing an HTTP redirect if there is a URL stored in the metadata data record for that handle.

For example, suppose there is a handle server running at hdl.ands.org.au, which manages handles under the ANDS PIDS handle naming authority '102.100.100'. Further suppose that the ANDS PIDS handle 102.100.100/10 includes the URL 'http://tardis.edu.au/experiment/view/10' in its metadata record. The  server  hdl.handle.net  is running a handle proxy server, responding to requests for URLs that begin with 'http://hdl.handle.net/'.

  1. A user requests the URL 'http://hdl.handle.net/102.100.100/10' through their browser.
  2. The proxy server interprets this as a request for the handle '102.100.100/10'.
  3. The proxy server contacts the Global Handle Registry (see §2.6) to find the handle server for the naming authority '102.100.100'.
  4. The Global Handle Registry returns the handle server's address: 'hdl.ands.org.au'.
  5. The proxy server contacts the handle server at hdl.ands.org.au and requests all metadata records associated with the handle '102.100.100/10'.
  6. The response contains a URL ('http://tardis.edu.au/experiment/view/10'), so the proxy server returns an HTTP redirect to that URL.
  7. The user's browser follows the redirect, eventually displaying http://tardis.edu.au/experiment/view/10

This is the behaviour if the associated metadata record contains exactly one URL. If it contains no URLs, instead of a redirect, the proxy server serves an HTML page displaying the contents of all associated metadata records.

If the associated metadata record contains more than one URL, the proxy server's behaviour serves a redirect to the first URL encountered in the handle record.

NOTE: ANDS does not provide a Handle proxy server. Instead, ANDS utilises hdl.handle.net as the preferred resolver for persistent identifiers. ANDS PIDs can therefore be cited in the format: 'http://hdl.handle.net/102.100.100/10'.

More information

2.6 Global Handle Registry

In addition to individual Handle servers and Handle proxy servers, there is a global registry of Handle servers, known as the Global Handle Registry (GHR). The registry maps namespaces (such as 123.456) to IP addresses, so that any user can use any Handle server or Handle Proxy Server to look up any  Handle.  Every  Handle server thus knows how to contact the GHR, which tells it how to contact the Handle server corresponding to a given namespace authority.

This registry is hosted by CNRI, who charge an annual fee for each Handle server registered. The CNRI also manages a central GHR in Virginia, USA and two GHR mirrors located internationally, with plans for further expansion.

More information

3. Policy

In order for persistence to be realised for the various aspects of an identifier, a robust policy infrastructure needs to be in place. Identifiers are created in different ways for different purposes and interact with their environment in complex ways. Consequently policies can be applied to a range  of  possible  identifier uses. This section provides guidance on a range of possible policy considerations, with some recommended practices for realising persistence. The policy considerations are broken up into questions on:

  • what identifier labels should look like;
  • how identifier systems should be managed;
  • how identifier services should be managed; and
  • how identifiers should be assigned to things –

all of which all need to be worked out before considering how identifiers can best be persisted.

Increasingly, the long-term maintenance of research data is governed explicitly by data management plans, which express a negotiated understanding between the researcher and the institution maintaining the data and publishing it online. Persistent identifiers are essential to the long-term accessibility  of  resources,  which are not restricted to appearing at only one network location or institution.

3.1 Label Policy

3.1.1 Introduction

This section explains several issues in label policy, such as the use of meaningfulness and the implications of format choices.

3.1.2 Meaningfulness of labels

One of the first policy choices that managers of persistent identifiers are confronted with is whether identifier labels should be meaningful. If the label is meaningful — that is, if a user can infer things about what is being identified from the label — then the identifier may be easier  for  people  to remember, to enter without error, and to communicate to others.

However, meaningful labels are usually based on attributes of the things identified that are less likely to persist than the thing itself. The network address for a resource is one meaningful label to identify the resource by, and URLs exploit that meaning to do resolution. But resources move, so their  network  addresses  change. Other attributes such as title, institutional owner, or subject matter of the resource, are also subject to change.

Updating the label to match the current semantics of the object (i.e. renaming the object) is certainly possible, but results immediately in a broken link or its equivalent. And because the identifier can end up cited by anyone once it is published, it is impossible to update ('patch') all instances  of  the  identifier found online. Some organisations — notably, standards bodies — decide to freeze the label anyway in that case, but this produces the undesirable result of a 'meaningful' label with an obsolete meaning.

Consequently, widespread practice is to use an arbitrary label, either generated randomly, or by using an attribute which cannot change and is not particularly revealing, such as the item's creation timestamp. That way, any changes in the thing or its status do not affect the persistence of the label.

It is possible to preserve meaning in the label, but to obfuscate that meaning. This middle-ground approach may assist in error recovery, while avoiding the pitfalls of transparent meaning. For example, a timestamp can be encrypted or coded as an alphanumeric number.

Unless the attributes used to give meaningful labels can be strongly guaranteed never to change, meaningful labels generally pose an unacceptable risk to persistence, and arbitrary labels are commonplace for persistent identifiers. Common approaches are sequential numbers and timestamps; both are still  somewhat  meaningful,  but the meaning is not usually revealing, and can in any case be obfuscated.

3.1.3 Form of labels

3.1.3.1 URL safety

Labels used within identifiers need to be URL-safe, since identifiers will almost always end up used in URLs. They should therefore not contain characters which need encoding to be embedded safely in URLs, such as '&' or space: such conversion can confuse users as to whether the encoded or the  unencoded  label  is the 'real' label. For example, 'a&b', when URL-encoded, becomes 'a%26b'. URL normalisation and URL encoding is intended to deal with such issues over HTTP, but they do not apply to all contexts in which URIs appear and are still traps for the unwary. Handle identifiers  can  present a risk, as a wide range  of characters are permitted.

NOTE: The ANDS Identify My Data product always generates URL-safe identifiers.

3.1.3.2 Variant forms

More generally, labels with multiple possible variant forms should be avoided, as users (or systems) risk assuming that the variants are distinct after all. For example, the ARK identifier system considers the labels '712-4' and '7124' to be equivalent, since it strips out hyphens.  However,  other URL-based services will treat them as distinct, and  human users will typically do likewise. So citation tracking of the identifier might fail; assertions made using the two forms of the identifier might not be applied to the same thing; migrating identifiers to different identifier  systems  might artificially differentiate the identifiers; indexing may  duplicate entries for the resource. Conversely, case sensitivity should be avoided, as should visually confusable characters (1 I l, 0 O), as humans risk failing to distinguish them.

NOTE: The ANDS Identify My Data product does not generate labels which can be confused within ASCII, as it uses numeric labels exclusively.

3.1.3.3 Punctuation

Labels will likely be delimited by punctuation, both when cited in running text, and when embedded within URLs or other identifiers. For that reason, punctuation should be avoided in labels, if there is a risk of confusion about where the label ends. A label with a trailing comma, such as 'fred,',  can  be  confusing when cited in text (readers will assume the label excludes the comma). Likewise, a trailing slash in a label risks being mistaken for a delimiter, if that label is embedded in a URL. For example, the following two URLs would normally be considered equivalent: 'http://www.example.com/resolve/992'  and  'http://www.example.com/resolve/992/'.

NOTE: The ANDS Identify My Data product does not use punctuation within labels.

3.1.3.4 Label length and format

If there is any prospect that identifiers will often be entered into a system manually, then labels should be short enough for human users to remember in their short-term memory (7±2 'chunks' of information, e.g. 7 characters or words); they should certainly be short enough to write down or  type  (20  chunks or less). On the other hand, the maximum label length should be large enough that label possibilities are not exhausted in the foreseeable future. So if millions of labels will be assigned for a context, the label size should not be restricted to just four characters.

If labels are arbitrarily generated, they should if possible be of uniform length, in order to simplify error checking. The label generation algorithm used should track previously used labels to avoid one name being used to identify two different things.

An effective operating practice is to use arbitrary, uncased alphanumeric labels, avoiding I and O, with a fixed length between four and nine characters long, depending on how many identifiers will ever need to be assigned. Such labels are already widely used in systems like TinyURL.

NOTE: The ANDS Identify My Data product uses short numeric labels.

3.2 Identifier management

3.2.1 Ownership of identifier systems

Management of identifiers can be separated from management of data. It is important to make decisions on identifier policy based on an understanding of the consequences of this separation.

3.2.1.1 Updating identifier resolution

The identifier manager undertakes to the end user to maintain the persistent identifier. The identifier manager publishes the identifier, and so has institutional responsibility for it. The identifier manager is often the same as the identifier provider, who provides the services  for  managing  and accessing the identifier — so the identifier manager sets up the identifier management system, and also issues updates to the system. In ANDS' case, they are distinct: for the ANDS Identify My Data product, ANDS takes on the identifier provider role and the product consumers  assume  the  identifier manager role.

To maintain the identifier, the identifier manager has to coordinate with the data manager, who is responsible for keeping the resource identified online. The data manager in turn is publishing data on behalf of the data provider, who is typically the researcher.

If the data manager moves the resource to a new address, or takes the object off-line, the identifier manager has to be aware of this and update the identifier accordingly. The identifier manager and the data manager are also not necessarily the same person: the identifier may be managed by a different  party  from  the data. For example, a department has one contact point for issuing updates to ANDS, but the updates originate in several separate labs in the same department: the labs have their own data managers, who are coordinating with the department's identifier manager, to communicate all the needed  updates  to  the identifier provider.

Ensuring that the identifier is updated smoothly requires coordination between the identifier manager and the data manager. The more separated the identifier manager is from the data manager — especially if they belong to different institutional structures — the harder such coordination  is  to  realise. This has been called the 'Our Stuff vs. Their Stuff' problem: it is harder to persist identifiers for data that is under some other institution's control ('Their Stuff'), whereas managers working under a single authority ('Our Stuff') can more easily co-ordinate with colleagues and put  the  necessary  procedures in place.

The data provider is also involved in working out the best policy structure for identifiers. The data provider has the best notion of how the identifier will be used by the user community, of how long the data will be useful, and which parts of the data identifiers should point to. (This involves information  modelling:  see  the section on policies below.) The identifier manager is responsible to the data provider to keep their data accessible, as much as they are to the end users.

Under this two-tiered resolution arrangement the identifier resolution service must be updated promptly when the URL for the data changes. If the data manager is also the identifier manager, this is straightforward. However, when identifiers are managed externally (by ANDS, for example), it is impractical  for identifier managers to detect and respond to such changes in the data identified. ANDS won't know that you have moved your data. It is your responsibility as a data manager to update the resolution for your identifiers promptly.

3.2.1.2 External identifier policy

By using an external identifier service, you are also constrained by that service's identifier policies and you have less scope to set your own policies. If the identifier service dictates a certain label format or amount of authority metadata for instance, you cannot set your own policy contradicting  that.  The  ANDS Identify My Data product, for example, generates all identifier labels, eliminating the need for the creation or implementation of any policies on identifier labels by the data manager.

You will also have less control over the kinds of identifier services you can provide, because those services rely on the information provided by the external service. (These constraints also hold if you host your own identifiers, but still share your system with other institutions.) Furthermore, you  cannot  brand  identifiers as your own, a useful restriction for ensuring persistence despite ownership changes. On the other hand, if you run your own identifier infrastructure, you are burdened with the commitment of setting explicit policies, maintaining the system for reliability and performance, and  having  to  build up local expertise.

3.2.1.3 Benefits

A benefit of using a common identifier service is the isolation of identifier management from changes in data ownership. This means that if control of the data identified passes from one institution to the other, the identifier managed by a third party is not affected: the new data manager can establish  the  same  relationship with the identifier system as did the old data manager. However, identifier managers still do not have universal scope: a common ANDS identifier may deal with data moving from Victoria to Western Australia better than it will deal with data moving from Victoria to Germany as the  formation  of  a Handle prefix includes a country code.

3.2.1.4 ANDS identifier management policies

Although ANDS does not set restrictive policies about use of ANDS identifiers, these identifiers themselves must comply with the requirements of the Handle protocol. Additionally, users of the ANDS Identify My Data product can (and should) set some of their own policies. Identifiers do not become persistent simply by minting them.

More information

3.2.2 Context management

When an organisation manages its own identifiers, it can also organise its identifiers into a hierarchy of contexts, akin to DNS subdomains. For example, a university's identifiers could possibly be broken up into library identifiers and researcher identifiers. Context hierarchies allow delegation  of  identifier  management, and profiling of identifier policies to different domains. But a subcontext should still conform to the embedding context (a university library identifier is still the university's identifier), so the identifiers still conform to a central, core policy profile.

NOTE: The ANDS Identify My Data product currently provides one namespace (identifier system context) for all its identifiers, regardless of who has requested them. That means that the identifiers of a single organisation are not visually distinguishable.

This means the service is likely to be more consistent and robust.

3.2.3 Authority metadata

For users to trust claims of identifier persistence, mechanisms are needed to allow those claims to be defensible. Users should be able to determine who is claiming the identifier is persistent; who is acting to keep it persistent (and who has done so in the past), how they are doing it, and how long  they  intend  to keep the identifier persistent.

At a minimum, identifier users should be able to recover, as publicly available metadata, when the identifier was last updated, and current contact details for the identifier manager. For ANDS PIDs, that means the party using the ANDS update services, rather than ANDS itself. If the identifier stops  working —  the  resolution becomes out of date — users can contact the identifier manager to alert them to the error, or to get more information on the current status of the resource.

Contact information should itself be reasonably persistent; e.g. the maintainer should be identified by role and not as an individual. Further authority metadata, such as who created the identifier when, what type of thing is being identified, and who has managed the identifier in the past, can also  be  included;  this can extend as far as maintaining logs on identifier operations.

Because authority metadata is used when things go wrong, its availability should not be reliant on external systems: failure to access an external system may be why things have gone wrong to begin with. Contact data should therefore be stored directly in the identifier record, rather than linked through  some  external  database.

3.3 Identifier services

3.3.1 Resolution guidelines

There is a longstanding conflation of resolution and retrieval in URIs, leading users to expect that they can perform one (and only one) kind of action online with identifiers (resolution). To address this expectation, online identifiers should at least provide resolution behaviour as a default. For  example,  handle  records should include a URL field.

The meaning of 'resolving' an identifier depends on the context. The meaning of a usable representation of the thing identified, to be delivered through retrieval, also depends on context. For that reason, the resolution behaviour of an identifier should be sensitive to context, or at least rich enough  not  to  rule out certain contexts. Different resolution services need to be exposed explicitly, if the user is to realise they are available.

3.3.1.1 Abstract and concrete resolution

Identifiers can exist for abstract entities, such as the concept of an academic work, rather than physical copies of the work. Identifiers for abstract entities can be resolved in a number of ways. An abstraction resolution provides a description of the entity, such as a bibliographical citation. A concrete resolution provides a representation of the content — in effect, a retrieval.

Abstract resolution is more correct for abstract entities: the identifier for an abstract document identifies all versions of the document, not just the latest PDF version, or a browser preview. However, as people typically access identifiers to obtain viewable content, abstract resolution on its own  is  less  useful.

Which form of resolution you should provide depends on what users expect to do with the identifier. The common resolution practice in institutional repositories reflects this: identifiers resolve not directly to document or data retrieval, but instead to a splash page containing metadata about the  resource.  This  is a kind of abstract resolution: the splash page provides hyperlinks, clearly labelled as such, so that users can choose the most appropriate representation or service to access the resource.

3.3.1.2 Resolver persistence

If persistent identifiers are cited with an associated resolver service (that is, 'http://hdl.example.com/123.456' rather than just 'hdl:123.456'), users will reasonably expect that the resolver is as persistent as the identifier. That is, even if the underlying identifier record ('123.456') is persistent,  users  will  perceive it as broken if the URL they have stops working. Because the identifier manager is not always responsible for the identifier resolver, a resolver service should be selected with care, to ensure it is trustworthy.

If an identifier service is to be called with additional parameters in a URL query, e.g. to specify the particular representation to be retrieved, it is important to take care to delimit the identifier proper from the parameters, to prevent users assuming the parameters are part of the identifier.  Identifiers  can  be presented in different encoding schemes. URL encoding for example allows a URI to be embedded safely within another URI.

3.3.2 Citation of Handles

There are different ways persistent identifiers can be cited; when publishing persistent identifiers you will need to make a trade off between persistence and usefulness. A Handle, such as an ANDS persistent identifier, can be presented as just a name (e.g. a URN) — which improves persistence  —  or  as a resolvable URI, which is more useful in a digital environment.

If the name is used, the context of the name — that is, the identifier system used — should be made explicit (for example 'Persistent Handle: 102.100.100/15').

While Handle does not have its own recognised URN prefix, 'hdl:' can be used informally (for example 'hdl:102.100.100/15'.) The only standard way of presenting a Handle as a URN is in the Info-URI space, as 'info:hdl:' (for example 'info:hdl:102.100.100/15'.)

To present the identifier as resolvable, it is important to choose a resolver that users are confident will remain available for the long haul (for example 'http://hdl.handle.net/102.100.100/15').

NOTE: As ANDS PIDs are intended as part of the data fabric for Australian research, online resolvability is critical. ANDS recommends citing identifiers uniformly as HTTP URIs, using the default Handle resolver, e.g. as 'http://hdl.handle.net/102.100.100/10'.

3.4 Identifier association

Keeping identifiers persistent consumes resources, and should not be undertaken lightly. Data managers need to prioritise what to identify persistently in their domain. Those decisions depend in turn on an information model of the domain of objects that may potentially be identified: persistent identifiers  will  only  be assigned to a subset of those objects. Drawing up such an information model can help anticipate how identifiers are likely to be used and adjusting the information model can capture explicitly what the changes in those expectations are.

3.4.1 Information modelling

The information model should not be restricted to research data and research outputs, but should track all the entities that help contextualise and make sense of the data. For example, research data is organised by its subject matter, so the different subjects and samples that the data is about may  themselves  need  to be identifiable in the future. The same holds for the experimenters and institutions involved in the research, and the instruments, simulations, and workflows used to obtain the data.

The possibilities for persistent identification are not restricted to concrete instances of documents and files: information modelling can identify more abstract levels of resource description. These abstractions are more liberal in what is available as an acceptable representation for retrieval; so  they  are  potentially better candidates for persistent identification, because they are not dependent on the ongoing availability of a specific representation. Recurring abstractions that occur independently of domain include:

  • aggregations and disaggregations of other things;
  • different presentations of things (including different file formats and encodings), with identical content;
  • different versions of things (including revisions, drafts, transformations, and translations); these differ in their content in minor ways, but represent the same underlying thing;
  • abstract works (the 'same underlying things' that versions and presentations refer back to).

Identifying different levels of abstractions requires metadata to distinguish them and represent the relations between them. For example, if different versions of a file are identified, users of the identifiers for the versions should be able to recover files corresponding to those particular versions,  and  how  they are related.

3.4.2 Prioritising Persistence

An analysis of business processes is used to select which things to identify persistently. Although it can be difficult to anticipate what use an identifier will be put to once published, domain knowledge can provide some informed guesses. Such decisions depend on a variety of considerations, discussed  below.

3.4.2.1 Will the thing identified be persisted itself?

If something identified is destroyed, it may be important to keep identifying it for archival purposes; but there is a higher priority on persisting identifiers for things that are still online. For example it may be more important to identify versions of a file persistently if those versions are still  accessible,  than  if the old versions of the file have been overwritten. Whether deleted objects should have persistent identifiers at all depends on whether they are externally cited (see below).

3.4.2.2 Will it be published?

Will the thing identified be available outside the curation boundary?

A persistent identifier makes more sense for data which has crossed the curation boundary (is publicly cited by a large number of people, is stable, and will change network location only as a well-defined object). The responsibilities of the identifier manager are greater once the object crosses the  curation  boundary.

If various drafts of an object are created internally, but only the final version of the object is released, then there is less need for the previous drafts to be persistently identified: unreleased drafts moving location may be less disruptive than a released version moving location. It is also possible  to  publish  an identifier without making the resource itself publicly available, but that is less typical. There is an expectation that if the identifier is public, at some stage the thing identified will be public as well.

3.4.2.3 Will it move servers as part of its normal workflow?

Even if it is not yet published, research data often moves between servers, e.g. from the lab to a collaboration server, or to a researcher's private space. If data needs to be accessed consistently throughout such moves, it may make sense to identify it persistently for the duration, and to maintain  resolution  using  that identifier throughout. This is still usually lower priority than having persistent identifiers in place for data that is already published.

3.4.2.4 Will it be cited externally?

As with the curation boundary, if a third party links to the thing identified, it is important to make sure the link keeps working. Having the link break will be very disruptive: it is hard to predict who will link to the thing, and impractical to warn everyone that its location will change. Conversely,  if  the  thing will only be linked to internally, then persistent identifiers may not be required: when changes occur, all concerned parties can be alerted directly.

3.4.2.5 Does the information model matter?

If there are business processes that depend on specific versions of a file, then those different versions should be identified differently. But if the version is irrelevant to any real business processes using the file, or if only the latest version is ever called on, then there is no business motivation  to  identify  them separately.

Likewise, aggregations and disaggregations of objects are open-ended in number, but it is only worth identifying a particular aggregation if there is a business process that will actually make use of it. That means that the thing persistently identified should make conceptual sense to the people and  processes  interacting  with it. The granularity at which objects are identified also depends on how users will interact with the objects. In turn, this means that the thing identified should be easily described through metadata, if it is difficult to describe the thing, it probably does not make enough  sense  as  a concept to identify persistently.

3.4.2.6 Is the thing identified stable?

Persistently identifying something raises the expectation that the thing itself is not only accessible over a long period, but is a well-managed and well-defined object. This means that any changes to the thing should, where practical, be well-documented and accountable: if what is being identified  is  'the  same thing' over time, users should be able to work out why the thing does not look identical to what they may have accessed last month.

3.4.2.7 Is the thing under the control of the identifier authority?

Persistence implies accountability for any changes in the thing being identified — notably its online location. Such accountability is much easier to realise if the same authority manages both the identifier and the resource identified, because that authority can easily put automatic update procedures  in  place.

Of course, keeping the identifier up to date is not impossible if the identifier and the resource are managed by different parties: the ANDS Identify My Data product depends on that scenario. But this does require that the data manager independently ensures that the resource information associated with the identifier is kept up to date

3.4.3 Timing of persistent identifiers

Because the identifier and the resource identified are discrete digital objects, management of the timing of the creation and public release of the two digital objects has to be coordinated. A digital object should not be published before its persistent identifier is published. Otherwise, the digital  object's  non-persistent  URL could end up cited instead of its persistent identifier: once third parties start using a particular identifier, it is difficult to achieve a switch to another identifier.

An effective operating practice is to associate the name with the thing in the identifier record, before either is published. You can publish the name before the thing is accessible, with the understanding that it will fail to retrieve the thing in the short term.

3.5 Identifier persistence

3.5.1 Planning for identifier persistence

To minimise disruption of persistent identifiers as much as possible, a persistent identifier should be coupled only loosely to the current technologies used in data management.

3.5.1.1 How to use URIs

Care is needed when using current network address URLs to identify objects. A URL is tightly coupled to the current technology used to deliver the object, and becomes obsolete if the object is migrated to another server. This does not mean that HTTP URIs should be avoided; rather it means that care  should  be  taken when using HTTP URIs. The file location-specific URL (e.g. 'http://.../~jsmith/pubs/paper-jan09.pdf') should be avoided as much as possible, in favour of an abstract URI such as 'http://.../items/a42p5'. Several identifier schemes rely on a two-tiered system of indirection to achieve  persistence:  the  identifier is resolved to the resource's current URL (as a 'locator'), which is then used to retrieve the resource. This is conceptually possible because the current URL of a resource is distinguishing metadata about the resource, so an identifier can resolve to a URL. Under this arrangement,  the  current  URL may change as the object moves, but the identifier itself does not have to — so long as the URL resolution is kept up to date. Users can keep using the original identifier to access the resource, and the identifier is managed to keep its resolution up to date.

The two-tiered system introduces the possibility of using identifiers that are not URLs. A URL can be persistent by having it aliased to whatever the current URL of the resource is, and updating that alias; that is the model that HTTP redirection is based on. But a resolver — a service performing  identifier  resolution  — can also take an identifier which is not a URL, and map it to the current URL of the resource. So long as the resolver is in a known place, the identifier itself does not need to be a URL. In fact, some communities prefer identifiers which are not URLs, to avoid confusion  between  the  persistent identifier and the current URL.

This defines how technology can keep an identifier persistent; but it only tells part of the story. The real challenge is in the policy and the expectation of trust that surrounds a persistent identifier.

This approach also includes processes dealing with the object internally. The more a non-persistent identifier is used, the higher the risk that it will leak into the public domain (where it may be harmful), and the more dependency is introduced on a transitory identifier.

ANDS recommends using a two-tiered resolution arrangement (accessing the actual network address of a resource indirectly), in order to avoid excessive coupling of processes to the current file location. However, any processes that will affect the way the persistent identifier resolves need to be tightly  integrated  with  processes updating the identifier resolution.

For example, the process for moving a resource to a new directory or server should be engineered to simultaneously update the identifier record. This eventually should lead to the identifiers being an added layer of information infrastructure, leveraged to provide a more generic, less technology-dependent  way  of  managing data.

3.5.1.2 Updating a URI

When moving identified items, identifier updates should happen without significantly disrupting user access through the identifier:

  • Copy the item to the new location.
  • Call PIDS to update the identifier record.
  • Check that the record has been updated successfully.
  • Remove the item from the old location

NOTE: The ANDSIdentify My Dataproduct includes a set of web services that can be used to integrateIdentify My Datawith other software. For members of the ANDS community using this option, the update process can be automated. When checking whether the record has been updated successfully, note that  handle values are cached by the proxy resolver. This means that the 'auth' parameter should be used in resolving the updated record, to ensure that record values are fetched from the ANDS handle server and not from the proxy cache.

3.5.2 Persistence time span

No identifier will persist forever. However, identifier authorities can help identifier users plan for change usefully, by issuing an undertaking to support persistence for a fixed time period. That undertaking should be discoverable by identifier users.

Some communities may need the identifier to outlive the resource for different lengths of time, so that historical citations of the identifier are still usable, while others do not. The length of persistence also depends on the technical and governance constraints imposed by the identifier's own infrastructure.  Critically,  this  includes planning for the future management of the identifiers, when the current manager has moved on.

3.5.3 Persistence guidelines for change in names

In addition to other forms of persistence disruption discussed above, disruption can occur when a thing starts being identified by a different name, and the original name is no longer associated with it. Usually, the mere existence of two names is inconvenient or confusing, rather than disruptive:  it  disappoints  the presumption of a Universal Identifier (that you need only search for one name to gather all mentions of the thing identified), but that in itself does not disrupt the use of an identifier to identify things. However, if the old name ceases to function, disruption occurs.

The old name may be discontinued because:

  • The name is meaningful, and its meaning no longer reflects the current state of the identified object. This is an argument for avoiding meaningful labels in persistent identifiers.
  • The name no longer complies with the name policy of the identifier authority; e.g. it is the wrong length, contains the wrong characters, or uses inappropriate branding.
  • The identifier system maintaining the old name has been decommissioned, and cannot redirect queries about the old name to a new identifier system.
  • Control of the identifier has passed to a new authority, and the old authority can no longer maintain the identifier.

Eliminating the old name means it is no longer a persistent identifier, which compromises users who relied on it persisting. To address this problem, consider:

  • Continuing to maintain the old name in parallel with the new name. Circumstances may often make this impractical and this imposes an increased maintenance burden. If persistence has been guaranteed only for a fixed time, it may be necessary to wait for the last allocated identifier to expire before    decommissioning    the    system.
  • Keeping the old name accessible online, but only as an alias redirecting to the new name. This cannot be done if the identifier system must be taken offline; if it can stay online, it is a reasonable way of dealing with persistence, placing the onus for continuous maintenance on the new identifier    manager.
  • Updating the old identifiers to information-only splash pages. The user sees a page explaining that the old identifier is deprecated and will be discontinued. The inconvenience of seeing this page may discourage them from continuing to use it.
  • Patching all mentions of the old name to the new name. This is a common strategy in renaming, e.g. with web links, 'please update your links to…'. The more widely the old name has been used, the more difficult it is to get compliance with such a request: one unpatched instance of the old name    is    enough    to expose the failure of the old name to persist as an identifier.

Whenever a published URL ends up broken, it is very likely someone will be inconvenienced. Not all potential users can be notified, and not all references (such as paper ones!) can be updated.

Strategies to anticipate and mitigate against this problem include:

  • Avoid meaningful labels.
  • Use loose identifiers instead of rigid identifiers, allowing flexibility in what is identified, and concentrating where possible on the role played by the thing identified, rather than its specific instance. For example, 'the current data manager' is likelier to persist as an identifier target than    mentions    of    'Mrs Jo Bloggs', who may eventually leave the institution and will likely not remain the data manager for life even if she stays. Similarly, identify resources at the level of a 'work', so they can continue to be used regardless of version, format, or location. (The rigidity of identifying    resources    by    network location is why URLs have been so fragile as identifiers.)
  • Do not recycle names in a context. Even if an identifier has been deleted from an identifier system, citations of the old identifier may exist. Reusing an identifier to identify a new resource will lead to unnecessary confusion for users who happen to access the old citations.
  • Maintain labels between contexts. If an identifier moves from one system to another, aliasing or patching will be necessary. Both will be easier to realise if each name in the old system has the same label as the name in the new system. For example, if identifiers 'XYZ' and 'ABC' are migrated from    'hdl:1003.34'    to    'http://purl.org/xyz', things will be made much simpler if 'hdl:1003.34/XYZ' migrates to 'http://purl.org/xyz/XYZ', and 'hdl:1003.34/ABC' to 'http://purl.org/xyz/ABC'. The process of patching or aliasing can then easily be automated.

NOTE: The ANDS Identify My Data product does not recycle or duplicate identifiers. ANDS only mints an ANDS PID once (although the PID can be reassigned to another authority and related descriptive data updated).

3.5.4 Persistence guidelines for change in resolution

Persistence can also be disrupted if the same name being used starts to identify a different thing. This is an even more pernicious failure: users will not see broken links, and will reasonably assume everything is still working. However, what they are resolving to is different, so the expectation  of  persistence  of association (the identifier keeps identifying the same thing) has been compromised.

This kind of error can occur because the update procedures for either the identifier or the thing identified have failed, or because of human error. It may also happen because the information model for the thing identified has been misunderstood, so it has not been updated correctly. For example, someone  could  update  a file resolved to by an identifier, when the identifier was intended to reference a specific, frozen version.

To avoid erroneous updates of identifiers, the following measures can be taken:

  • Use identifier typing, ensuring that at least the same type of resource is identified whenever the resolution is updated. For instance, identifier metadata could include the MIME type of the digital object being identified; updating the identifier URL can be validated against the registered MIME type,    to    ensure    the new URL points to an object with the same MIME type.
  • Periodically validate identifiers, to check that an identifier is continuing to resolve (checking for 'link rot'), and digital fingerprinting (e.g. checksums) to confirm that the object retrieved at the URL has not changed.
  • Use multiple kinds of resolution data, to provide redundancy in case of something going wrong. For example, if the URLs that identifiers resolve to have failed catastrophically (so that not even the directory structure on the data server can be recovered), descriptive text fields on those same identifiers    can    still    help piece together which object was pointed to by which identifier.
  • Embed the identifier into the digital object, such as by citing the identifier in the content header, or through digital steganography. This way, the retrieved copy of the digital object can be used to confirm that is what was intended.

3.5.5 Persistence under changed management

Using an identifier update service presupposes that the data manager remains in contact with the identifier manager, and stays authorised to update identifier data promptly. However, any plan for identifier persistence must include a plan for the contingency when that arrangement is disrupted, and  the  data  management system can no longer communicate updates to the distinct identifier management system — though the thing identified is still online.

3.5.5.1 Update disruption

For an identifier update to work, the update information must reach the identifier system, and the updated identifier must be returned to data management. The process of identifier updating can be disrupted or discontinued for several reasons:

  • By accident, e.g. the identifier system is down. It is important that software used to do updates can cope with temporary outages.
  • Either the identifier or the data is no longer actively maintained, at least in its current form, and may even be deleted.
  • Either the data management system or the identifier management system is upgraded, and has changed interfaces which are now incompatible with the other system.
  • The data have moved to new management, who have not established an update relation with the identifier system — or cannot establish such a relation, for technical or policy reasons.
  • The identifier system has moved to new management, and has not (or cannot) establish an update relation with the data management system. For example, if University A takes over responsibility for Handle server 123.456 from University B, researchers would not immediately be able to issue updates. In    this    case    the identifier system has been moved to a different manager ('re-homed'), but it has not changed name, so end user access to its identifiers has not been affected by the move.
  • The identifier has moved to a new identifier context, i.e. a new server, and possibly a new identifier scheme. This means the identifier has been replaced with a new identifier. This disrupts not only the data manager, but also the end user: the end user has to start using a new identifier, because    the    old    identifier name has failed to persist. (See §3.5.3 'Persistence guidelines for change in names'.)

To deal with these contingencies, consider the following procedures:

  • Ensure that both data and identifier services are reliable, through routine safeguards such as mirroring, backups, and service level agreements.
  • Ensure data and identifier services are delivered without interruption during upgrades, and maintain backward compatibility as much as possible.
  • If data are no longer maintained, provide an indication of this in the identifier record.
  • Make explicit arrangements to establish a new trust relation between identifier and data managers, if either changes. One way of doing this is to give the new data manager write privileges for the identifier in the old identifier system. Such cross-institutional ad hoc arrangements may be problematic.    This    is    easier if identifiers are run by a third party provider, such as ANDS.
  • Delegate identifier management to a successor. Identifier persistence planning must include some kind of successor planning. 'Hosts of last resort' could take over persistent identifiers abandoned by their projects, and maintain them archivally. For example, the CLARIN project[24] replicates identifiers across many providers to avoid identifiers being orphaned through lack of a succession plan.
  • Alias an old identifier to a new identifier, in case the old identifier can no longer be actively maintained — but can still be kept online. This applies equally if an identifier is migrated to a new identifier scheme (see §3.5.3 'Persistence guidelines for change in names'), although that will    lead    to    problems in the interoperability of identifier services, or inconsistent functionality. No clear best practice has emerged yet on how best to migrate identifiers to different schemes.
  • As a much less preferable solution, 'patch' all accessible instances of the old identifier, replacing them with the new identifier. This is common, but cannot possibly cover all instances of a cited identifier once it is published. It is an essential mitigation step if the original identifier management    system    is    decommissioned.
  • Archive the old identifier: flag that it will no longer be actively maintained but still allow it to resolve to the last available information on the thing identified. Flagging can be limited to an informative metadata field (which end users would normally not see), or an explicit redirection to a    splash    page    warning users proceeding with resolution (which is increasingly seen online, but can disrupt machine-to-machine resolution). Out-of-date information on the resource is more useful than no information at all, and can help users locate the resource on their own, whether online or not. Descriptive    metadata    in    the identifier still preserves the association of name to thing, even if the thing can no longer be retrieved through the identifier.

4. Glossary

Abstract

See concrete.

Accountable

An object is accountable if information is up to date and available to outside parties regarding who has managed the object and how, both in the past (provenance), and currently. This information is called authority metadata. If the object is      accountable,      parties      can determine who has been responsible for it, and what changes have been made to it, and can follow up any queries with them.

Aggregate Object

A digital object is an aggregate if it is composed of other digital objects. For example, an online course could be an aggregate object, composed of other course modules. Aggregate objects require a policy decision of identifier managers, on whether identifiers should be published for each component      and      the      aggregate object.

Alias

An identifier A is an alias of a target identifier B if they both identify the same thing, and any changes to B are automatically reflected in A (but not vice versa).

Allocate

To transfer control of something to an authority.

Appropriate Copy

An Appropriate Copy service is a resolution service which selects one of several instances of a resource, based on context such as user parameters. If an identifier has multiple resolution, an appropriate copy service can resolve to the most appropriate instance of the thing identified, given the instances nominated by each possible resolution.

Arbitrary

An identifier is arbitrary if there is no direct relationship between the name and any relevant facts about the thing identified. An identifier is arbitrary by policy, if there is a policy to ignore any relationship between the name and the thing identified in using the identifier.

Archival Lifespan

The archival lifespan of a digital object is the time during which either the object or information about the object is maintained, regardless of whether the object is actively used. For example, once a digital object is archived to tape, its active lifespan may be over, but its archival lifespan continues      until      the      tapes are destroyed. An identifier can continue to point to a thing throughout its archival lifespan: even if the thing is no longer online, the identifier can provide access to information on how to retrieve the digital object from storage.

Archival lifespan varies by domain; digital archiving quotes figures between 15 and 100 years.

Associate

See Identify

Association data

Association data associates a name to the thing identified in an identifier, in a form that can be registered on an identifier management system. For digital objects, association data is often a retrieval key for the object from a data source (e.g. a locator). More generally, resolving an identifier returns some form of the association data for the identifier that can be used to determine the thing being identified, by distinguishing it from other things. That means it can apply to things that are not retrievable online; e.g. an offline resource, or an abstract concept.

Association policy

An identifier authority's association policy defines the range of things that can be assigned an identifier.

Authority

An authority for an object is responsible for maintaining it. This includes ensuring that it has accurate content. For example, identifiers, policies, and identifier management systems all have an authority maintaining them.

Authority Metadata

See Accountable.

Canonical Identifier

See Preferred Identifier.

Cite

An entity is cited if its representation is communicated to an audience through some medium. The entity is citable if it can be cited.      For example, citing the identifier {('Handle server 102.100.100', 'XYZ'), 'ANDS policy on citation'} may require choosing an appropriate string representation of the identifier (e.g. 'hdl:102.100.100/XYZ'), and embedding that representation in a PDF.

An entity is web-citable if the representation can be embedded in a document on the web, print-citable if it can be embedded in a print document, and speech-citable if it can be embedded in spoken text (e.g. read out over the phone). An identifier should be citable in non-digital environments.

Compound Digital Object

See Aggregate Object.

Concrete

A concrete entity is a digital object, such as a PDF version of a published paper, directly managed in an electronic system. By contrast, an abstract entity — such as the paper itself — cannot be directly stored, but can still be identified and have metadata about it published. An abstract entity can be realised by multiple concrete entities, in different formats, versions or languages. This realisation process is frequently modelled explicitly through relationship data stored in      the      system.

Context

A context differentiates labels used for distinct purposes and with different authorities. The combination of a label and a context      gives      a name ; any label is necessarily unique in its context, though the same label may occur in different contexts. Contexts impose policies on the labels in the context, including association policy, determining      how      the name is interpreted as an identifier. Contexts can also impose policies on what labels are allowed in the context (label format policy), and who can perform what actions on a name in that context (access policy).

Contexts are sometimes themselves identified by 'context identifiers'.

Core Service

A core service in the context of persistent identifiers is a service targeted directly at the maintenance of persistent identifier, such as Register, Update, Delete, and Resolve. The ANDS PID services are core services. They are distinguished from value-added services, such as checks for link rot, checksum validation, and intelligent resolution.

Create

Creating a thing is distinguished from registering it. For instance, geographical coordinates can be created for use as an identifier, without those coordinates being registered explicitly in anidentifier management system.

Curation

Curation describes a range of activities and processes which create, manage, maintain, and validate an object. Curation is undertaken by the managers of an object, not by end-users of the object. Curation of an object begins in the lead up to its publication; if the object is already published, curation can lead to a new event of publishing, such as a new release.

Curation boundary

The curation boundary is a conceptual boundary defined by access to actions for curation, and is used to model the concept of publishing. A digital object within the curation boundary is only accessed through curation actions by authorised parties (administrators); the object is not accessible to end users. Curation of the object is undertaken with the aim of improving it to the point where it is ready for publishing. While inside the boundary, the object is not yet considered published, even if there are multiple authorised curators.

Once ready, the object is published, by moving it outside the curation boundary. It is now accessible via a wider range of means, by end-users. The end-users' access is by definition read-only.

The curation boundary can be crossed without necessarily providing public access to the digital object: end users may still need authentication to read the object. Also, if metadata about the object is released outside the curation boundary, it is considered published, regardless of whether direct access to the object itself (retrieval) is provided.

Data source

A data source is a tool for the storage and management of data by a party. The data source exposes services which allow access to that data by other parties.

Digital Object

A digital object is any thing transmittable through electronic networks and storable on a data source.

Enclosing Context

See Subcontext

Encoding Scheme

A mapping of labels to labels, preferably one-to-one. An encoding scheme may be appropriate to representing labels in a particularmedium.

Equivalent

Two identifiers are equivalent if they both identify the same thing. For example, 'hdl:123/456' and 'http://purl.org/poi/foo.org/paper-100' are equivalent if they both identify the same paper.

Expression

See Version.

FRBR

The Functional Requirements for Bibliographical Records[25](FRBR) is a relational information model for bibliographical objects, which has also been applied to digital objects. The FRBR model includes four levels of representation: work, expression, manifestation, and item.

  • the work, a distinct intellectual or artistic creation
  • the expression, the intellectual or artistic realization of a work
  • the manifestation, the physical embodiment of an expression of a work
  • the item, a single exemplar of a manifestation.
Global

In the context of digital objects, an object is global if it is accessible through a readily available protocol such as HTTP, outside a private or proprietary domain. An object can be used globally but still be subject to authorisation.

Handle system

The Handle system[26] is one of a number of identifier management systems available for online identifiers. It is a robust and flexible system, which allows different kinds of resolution and metadata to be associated with identifiers. The ANDS PIDS is based on Handle.

Hosting

An institution hosts an identifier management system if the institution (as a corporate entity) is the authority for the system. This requires the institution to provide infrastructure to make the identifier management system functional and trustworthy. An institution may instead choose external hosting for the identifier management system. In that case the institution retains responsibility for managing the digital      objects      in      the management system, but delegates the infrastructure for the system (i.e. hosting the system) to another party.

Identifier

An identifier is the association of a name with a thing. A name may only be associated with one thing at any time, and the name is said to identify the thing.

Identifier Management System

An identifier management system is a collection of definitions,information models, policies, and data sources, used to manageidentifiers. The minimum requirement is a data source to store registered identifiers as digital objects; the data source is accessed by both administrators and end users through core services . An identifier management system has a curation boundary associated with it, so it distinguishes between administrators and end users; and it has an authority acting as its owner, who is responsible for the maintenance and persistence of the identifiers.

The identifier management system can be realised in a rather simple way: an Apache Redirect list is a kind of identifier management system for HTTP URI identifiers. In fact so is the File Allocation Table of the directory Apache exposes to the Web: it maps file names (which are identifiers) to disk blocks (which are things identified).

On the other hand, an identifier management system can be a full fledged registry, with rich metadata about identifiers, and built in procedures to enforce identifier policy. The Handle technology used by ANDS PIDS is closer to the more full fledged type.

Identifier Management System Context

Deploying an identifier management system defines an identifier management system context: this is a single concrete context specific to that deployment. The purposes, labels, policies and authorities of the      identifier management system context are defined with reference to the system itself.

Identify

A thing is identified by a name if the name and the thing together form an identifier. The name is associated through the identifier      with      the      thing, and this association is recorded in an identifier management system as association data.

Information Model

An information model is a model of things in a domain, their properties, and the relations between them. An information model informs the choice of what things to identify in an identifier management system.

Label

In online identifier management systems, a label is a string that can potentially be used as a name.

Locator

A locator is a string giving the location of a digital object in a data source, and can be used as a retrieval key to gain access to the object. A URL is an example of a locator, although not all HTTP URIs are locators. A locator is specific to a data source, and cannot      be      used to access an instance of the digital object in a different data source. A locator can be used as an identifier; but it will usually not be persistent. Persistent identifiers often resolve to the current locator(s) of the thing identified. This uncouples the persistent identification of a resource from the current retrieval key for the resource.

Loose

An identifier is loose if the thing it identifies can change over time. For instance, the identifier 'latest version of the ANDS citation policy' points to a single well-defined thing, but the content of that thing will change as new versions of the policy are released. What stays constant for a loose identifier is the role that the thing identified fulfills (e.g. 'latest', 'local'), rather than its content. Identifiers which are not loose are rigid.

Manifestation

See Presentation.

Meaningful

An identifier is meaningful if there is a direct relationship between the name and a relevant fact about the thing identified. Non-meaningful identifiers are called 'arbitrary'.

Medium

A vehicle through which a message is transmitted from a sender to a receiver.

Mint

The entire process of creating, registering, and publishing anidentifier through an identifier management system.

Multiple Resolution

An identifier has multiple resolution if the association datareturned for a request for resolution can be used to access the thing identified from more than one location. In typical identifier management systems, this means that the identifier can be      resolved to more than one URL. (Strictly speaking this is the more narrow notion of 'multiple retrieval', but because of the longstanding conflation of resolution and retrieval, this term has stuck.)

Multiple resolution is only possible if the thing identified is not a specific instance of a digital object at a given location, but anabstraction (e.g. 'any digital object with this content'). Any of thelocators returned by the identifier allows a valid resolution of the identifier, and allows the user the choice of which locator to access.

Name

A name is the association of a label with a context. For example, '12' is a label, but is considered a different name in the context of Arabic Decimal numerals, movie titles, or Apollo moon missions. The label must comply with any policy requirements that the context makes for the association to be valid. The same label paired with a different context gives a different name.

Namespace

A namespace is a concrete context for names (i.e. a context supported by a registry), used to disambiguate labels. It is typically included in the representation of names as a prefix. The namespace is typically the same as the identifier management system context.

Naming Authority

A naming authority is an authority over a system for managingnames. An identifier management system manages names as part of managing identifiers; and an identifier management system defines its own concrete context for those names. Therefore 'naming authority' is often used loosely to refer to the context they manage, which is an identifier management system context.

Obfuscated

An identifier is obfuscated or opaque if it is meaningful, but the meaningfulness of the identifier cannot be inferred by inspection. The meaningfulness of the identifier can only be inferred if one is aware of the process through which the name has been generated. For example, the identifier 'hdl:123/a85b3' could contain a datestamp for the creation of the handle, if one knows the formula.

Opaque Identifier

An identifier with no obvious embedded meaning.

Party

A party is a person or a group which can act as an authority over an object, or which can participate in various processes, including managing or using identifiers.

Patch

An identifier is patched if its name is changed, for whatever reason. The old name in the identifier is deprecated in favour of the new. Patching means that all cited instances of the identifier must be updated to the new name to maintain identifier functionality: this is increasingly unrealistic the more widely the identifier has been published and used.

Persistent

An object is persistent if it is managed and maintained for a defined time span. Maintaining the object includes ensuring that its published content (such as its association data) is valid at all times. The time span for persistence need not be indefinite. Persistence can apply to other qualities; e.g. persistence of actionability, of accountability, of association (i.e.      the association between name and thing identified in an identifier), of functionality (i.e. the type of thing being identified remaining the same). Normally when an identifier is called persistent, persistence of association is meant — the association between the name and the thing in the identifier.

Policy

A policy is a set of rules set by an authority, and have a particular scope over which they can reasonably be enforced (a policy domain); for example, an institution's policy is typically restricted to the members and property of the institution. Policies may      be enforced in an identifier management system for the things it maintains: this means that the systems ensure that the rules defined in the policy are true of the entities, actions and qualities maintained through the system.

Preferred Identifier

An identifier is preferred (or canonical) according to an authority, if that authority guarantees its persistence over other equivalent identifiers. The authority recommends that the preferred identifier should be cited, rather than any other equivalents. For example, a digital object in a repository may be identified by a title, a URL locator, and a PURL. The repository manager guarantees persistence only for the PURL. Therefore the PURL is the preferred identifier for the object, according to the repository manager.

Presentation

A presentation of a digital object is an abstraction fixing both the content and the appearance of the digital object. Two instances belong to the same presentation if they have the same content and the same appearance; they belong      to different presentations if they have different appearances, even if they have the same content. Presentations may differ by file format, schema, formatting, branding, and so forth. Manifestations in the FRBR model are a type of presentation.

Provenance

Provenance is the history of how a thing has been managed over time. Data documenting the provenance of a digital object are part of the authority metadata for that object, and can be used to establish accountability for any changes in the      object.

Publish

An object is published when it can be accessed by at least one non-curatorial (read-only) action for a given user profile. Publishing an object is conceptually passing it across the curation boundary. The activity of publishing a digital object is distinct from the activity of publishing an identifier for the object: an object is typically published through a resolvable identifier, but the two publishing events need not coincide.

Realises

A concrete entity realises an abstract entity if the two entities are equivalent in some way, and the concreteentity can      be used to fulfil requirements made of the abstract entity. For example, an abstract identifier identifies a thing and is citable; but it is notresolvable or accountable. A concrete identifier synonymous with the abstract identifies the same thing (because it is synonymous), and is citable; so it fulfils the same requirements. But it is also resolvable and accountable, so it is used in actual systems instead of the abstract identifier.

Register

A party registers an object if they cause it to be registered in a system, and the object is not already registered there. Registering is a curatorial action.

Registered

An object is registered if it is maintained (i.e. stored or represented) and managed in a system.

Re-homing

An identifier management system and its components are re-homed if the authority over the system (and potentially its physical location) is transferred from one party to another. A re-homing plan is necessary to ensure persistence of identifiers past any changes in identifier management. As long as an identifier system is not dependent on locators as identifiers or identifier resolvers, re-homing does not affect user interactions with identifiers. E.g. if a Handle server is re-homed from Melbourne to Monash, the context identifier will not change, so the handles transferred need not change. However if a URL server is re-homed from Melbourne to Monash, the context identifier (domain name) will typically need to change.

The ability to cope with re-homing is a difference between DNS-based and non-DNS-based identifier schemes. Resolver services still have Web-Resolvable locators, but non-DNS schemes typically have centralised resolver services to address this problem. For instance, calls to a Handle resolver service at http://handle.unimelb.edu.au/ will no longer persist if the resolver service is re-homed to Monash, but using a centralised resolver service like http://hdl.resolver.net.au/ mitigates this risk.

Reliable

An identifier is reliable if it remains resolvable without interruption for the duration specified in a service level agreement. This requires adequate IT infrastructure to be provided, including support for access, performance, and backups.      Reliability      is an important component of trustworthiness.

Most actions on identifiers involve both a resolver service, provided by the identifier management system, and an external service using the resolution data; e.g. a content delivery system. The identifier should be considered reliable so long as the resolver service is reliable, even if the external      service      is not. For example, if the identifier resolves correctly to a digital object on a repository, but the repository is down, the failure in reliability is the repository's, and not the identifier management system's. It is the data manager's responsibility to mitigate that risk, e.g. by mirroring      the      object      and allowing resolution to mirrored copies.

Representation

An entity can have one or more representations, which can be communicated to an audience through some encoding scheme. Communicating a representation of an entity is citing the entity. The representation of an entity is a single symbol, whatever the internal structure of the element.

For example, we have defined an identifier as an association of a label, a context, and a thing identified; but the representation of an identifier is a single symbol, presenting the name through a combination of encodings of the label and the context (the latter often optional). So {('National Library Names', 'XYZ'), 'ANDS citation policy'} can have      the single representation 'hdl:102.100.272/XYZ'.

Reserved

An element is reserved in an identifier management system if it has been marked as having a 'temporary' or 'in use' status. A reserved element cannot yet be published. Typically a reserved element may not undergo curatorial actions either until it is allocated to an identifier manager; for example, a set of labels may be reserved for use in identifiers, but are not actually used in identifiers until they are allocated to some identifier manager.

Resolve

Resolving an identifier is providing information about the thing it identifies. The data returned must be consumable by other processes, for instance a Retrieve process. An identifier is Internet-Resolvable if the information on how the thing identified can be requested and consumed through a well-defined Internet application protocol, and Web-Resolvable if that protocol is a defined web application layer protocol. An example of the latter is a Web service query on a Handle, with its request in HTTP GET and returning a URL.

Resolution in general operates on association data, registered with the identifier in an identifier management system. Resolution is a non-curatorial action.

Resolver

A resolver is a system which provides resolution of identifiers; that is, the resolver resolves identifiers, returning information based on association data. Resolvers are typically implemented in identifier management systems.

Responsible

A party is responsible for a thing if they are committed to ensuring the maintenance and accuracy of the thing. This makes that party the authority for the thing.

Retrieve

To gain access to a representation of a thing through its identifier. Loosely speaking, retrieving a thing on the web corresponds to downloading it. Retrieve presupposes resolve: it acts on theassociation data obtained for the identifier.

Rigid

An identifier is rigid if it identifies exactly the same thing at all times, with the same content (where applicable). For example, an identifier for Release 1.0 of a document is rigid, so long as that release is frozen, and its content does not change. Identifiers which are not rigid are loose.

Service

A service is an action operating on an object through some defined protocol for requests and responses, and hosted by a computer system.

Shared management

A party has shared management of an object if they are not the only party authorised to manage it. Shared management requires infrastructure to coordinate between the different managers of the object, to prevent inconsistency.

Subcontext

A context (enclosing context) contains another context (subcontext), if all labels in the enclosing context are also in the subcontext, and all policies enforced by the subcontext are also enforced by the enclosing context.

Synonymous

Two identifiers are synonymous if an authority claims that they are equivalent.

Transparent

An identifier is semantically transparent if it is meaningful, and the meaningfulness of the identifier can be inferred by inspection. For example, 'urn:etexts.charles-darwin/on-the-origin-of-species'.

Trust Boundary

A trust boundary delimits a set of parties, services, data sources and systems which may be involved in processes without need of authorisation. If a party, service, or system is outside another party, service, or system's trust boundaries, then authorisation is required in order for that party to participate in      the activities of the system.

Trustworthy

An object is trustworthy if an end user can be confident that their use of the component object will meet certain expectations. Those expectations typically include reliability, accuracy, andaccountability.

Typed

An identifier is typed if the kind of thing it may identify is fixed by an authority, through a policy implemented in an identifier management system. Typing an identifier ensures persistence of functionality (the identifier will interact in similar ways with action whatever it currently identifiers),      and      is      a weak mechanism for ensuring persistence of association.

Unique

An object is unique if there exists one and only one instance of the object within a given scope. For example, if a label is associated with a context in a name, the label must be unique in the context. If a name identifies a thing in an identifier, the thing must be unique in the identifier.

Universal

An identifier is universal if it is the only identifier in a context that identifies that thing. A universal identifier allows all actions operating on a thing to interoperate using the same identifier, and makes deduplication unnecessary. Universality is impossible over the context of all known naming systems: there      cannot be only one identifier in the universe for something, because some authority can always come up with a new name for the thing. However, various strategies attempt to emulate something like universality, including preferred identifiers, and services mapping between synonymous identifiers.

Within the context of a single identifier management system instance, on the other hand, identifiers are often universal: such systems associate only one identifier with a thing, and do not maintain synonyms or aliases.

Value-Added Service

A value-added identifier service is enabled or enhanced through the use of persistent identifiers. It is distinct from a core service, which is targeted directly at the maintenance of persistent identifiers. Value-added services lie outside the domain ofidentifier management systems, and are not typically hosted by them; they instead consume the core services provided by identifier management systems.

Verify

A party verifies a quality for some entity in an identifier management system, if they confirm that the value of the quality reflects a true fact about the world. The usual target of verification is that an identifier will resolve to valid association data. Qualities may be verifiable (verification is feasible), and verified (verification has taken place successfully).

Version

A version of a digital object is an abstraction fixing the content but not the appearance of the digital object. Two instances belong to the same version if they have the same content; they belong to different  versions if      they have different content, but are still regarded as the same underlying thing. Versions may include revisions, transformations, translations, and so forth. Expressions in the FRBR model are a type of      version.

Work

A work is an abstraction in the FRBR model representing a distinct result of intellectual endeavour. Works may have different expressions, manifestations and instances, which are nevertheless considered to represent the same intellectual endeavour.