Data de-identification, anonymisation and pseudonymisation are processes for removing identifying information from datasets, most commonly to protect the privacy of individuals. Data de-identification may also be used to protect organisations, such as businesses included in statistical surveys or other information such as the spatial location of mineral or archaeological findings or endangered species. Data de-identification may be mandated by legislation or ethical guidelines governing research.
The National Statement on Ethical Conduct in Human Research (2007, updated May 2015), published by the National Health and Medical Research Council, does not advocate use of the term de-identified data, but suggests the term 'non-identified' in preference.
This National Statement avoids the term 'de-identified data', as its meaning is unclear. While it is sometimes used to refer to a record that cannot be linked to an individual ('non-identifiable'), it is also used to refer to a record in which identifying information has been removed but the means still exist to re-identify the individual. When the term 'de-identified data' is used, researchers and those reviewing research need to establish precisely which of these possible meanings is intended.
Identifying information such as identifiers, names, addresses, gender, date of birth or other identifying information can be removed from datasets entirely, or coded or encrypted. Information can also be masked by changing data values or by aggregation.
Iain Hrynaszkiewicz, Melissa L Norton, Andrew J Vickers, Douglas G Altman, 'Preparing raw clinical data for publication: guidance for journal editors, authors, and peer reviewers', British Medical Journal, 29 January 2010. doi10.1136/bmj.c181.
Techniques for de-identifying quantitative and qualitative are quite different, the UK Data Service outlines approaches to both.
Implications for reuse
The purpose of de-identifying data is to allow it to be used by others without the possibility of individuals being identified. The loss of individual identities, however, means that it will not be possible to incorporate the data into other datasets which may include information about the same individuals.
For an overview of the potential for sharing data without linking it, see the Australian Bureau of Statistics; A good practice guide to sharing your data with others.
When to de-identify data
The need for data de-identification arises when data is published, shared or reused. Researchers need to consider legislation, policies and ethical guidelines that apply to them, as well as any undertakings made or informed consent obtained from funders or research participants.
If data is only being stored in its original form by the researcher who created it, and is not being shared or published, ethics and privacy requirements are usually met through access control and data security, rather than through data de-identification. Identifiers are usually needed for analysis of research data by the original researcher.
When de-identifying data it is important to keep in mind the possibility of re-identification. This usually occurs with large data sets which can be subject to data mining or other analytical techniques. For a lay guide to some of these issues, see "Anonymized" data really isn't-and here's why not.
De-identification is also impacted by legal requirements. In Australia, in addition to the Commonwealth legislation (the Privacy Act (Cwlth) 1988), each state and territory has its own privacy legislation. The Office of the Australian Information Commissioner offers links to all this legislation, and to other material.
- National Health and Medical Research Council, Australian Research Council, Australian Vice-Chancellors' Committee, 2007, updated 2015. National Statement on Ethical Conduct in Human Research. General guidance on de-identification is provided by Section 3.2, 'Databanks'. This document may be reviewed as a result of the Australian Law Reform Commission's review of the privacy legislation.
- Australian Bureau of Statistics, National Statistical Service Handbook Chapter 11 contains a summary of techniques for ensuring privacy.
Examples of guidelines, discussion of issues around de-identification and two case studies (this is not a comprehensive list):
- Privacy Professor, 6 Good Reasons to De-Identify Data. http://privacyguidance.com/blog/?p=3153.
- Ann Cavoukian and Khaled El Emam (2011), Dispelling the Myths Surrounding De-identification: Anonymization Remains a Strong Tool for Protecting Privacy. Information and Privacy Commissioner, Canada.
- El Emam, K., Arbuckle, L., Koru, G., Eze, B., Gaudette, L., Neri, E., Rose, S., et al. (2012). De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset. Journal of Medical Internet Research, 14(1), e33. doi:10.2196/jmir.2001.
- Freymann, J. B., Kirby, J. S., Perry, J. H., Clunie, D. A., & Jaffe, C. C. (2012). Image data sharing for biomedical research--meeting HIPAA requirements for De-identification. Journal of digital imaging, 25(1), 14–24. doi:10.1007/s10278-011-9422-x.