VOLUME 34 NUMBER 11 NOVEMBER 2016 NATURE BIOTECHNOLOGY To the Editor: In 2005 one of us (B.M.K.) coauthored a Correspondence in your pages entitled “The Babel of genetic data terminology,” which warned of a dangerously inconsistent and confusing set of terms in the literature describing the identifiability of genetic data1. We now are writing to report that, in the intervening decade, the literature has become even more discombobulated with regard to terminology. Here we summarize the Babel-like lexicon for de-identified data and provide our own suggestions for harmonizing terms. The benefits of next-generation sequencing2, mobile health apps3,4, cloud computing5 and big data analytics have now arrived. They are, however, accompanied by unwelcome friends: namely, a flourishing of novel re-identification techniques that have thrown the idea of guaranteed, total anonymization into question6–8. Moreover, international research guidelines are turning away from anonymization for reasons tied to data quality, participant withdrawal and the need to communicate findings and continually link with clinical or other data9. Nascent efforts to tie data protection to proportionate and realistic risk assessment are appearing10,11. Mandatory policies imposed by funders are pushing researchers toward greatly increased data sharing. Legal duties often require ‘deidentification’ as a form of privacy protection. Researcher understanding of ‘anonymization’ often differs in strictness from that which is actually necessary. This almost guarantees overor under-sharing, which poses risks to participant privacy or research potential, respectively. Legally, anonymized data is not personal data and thus not subject to personal data protection duties. But there is no consensus definition of anonymization. Although record re-identification codes are sometimes allowed12,13, law and policymakers tend to define anonymization as “irreversible”1,14,15. Occasionally, even indirect identifiers (or quasi-identifiers) seem permissible, as in criteria that ask whether a person’s identity “can be readily ascertained”16. Still others seem to contradict themselves. The UK Information Commissioner’s Office (London), for example, adopts a definition suggesting irreversibleness and conflates anonymized data with pseudonymized data17 (the latter meaning data that can only be re-identified with access to a deliberately crafted re-identification mechanism). The Global Alliance for Genomics and Health’s 2015 Privacy and Security Policy definition labels anonymization as a process that “prevents the identity of an individual from being readily determined by a reasonably foreseeable method”18. Later its Data-Sharing Lexicon refined anonymization to mean the “irreversible delinking of identifying information from associated data”19. The same holds for other terms describing identifiability. De-identification is often defined as synonymous with irreversible anonymization1,18,19. The US Health Insurance Portability and Accountability Act (HIPAA) similarly uses it to refer to data sets to which its anonymization process have been applied. But HIPAA also provides for ‘de-identified’ data sets to which a reidentification code has been added20. Moreover, the term ‘anonymous’ tends to be used by health researchers to describe information that was collected without direct identifiers, rather than data with identifiers that were later removed9,15; however, in recent data privacy instruments such as the European Union’s (EU; Brussels) General Data Protection Regulation14, anonymous is used interchangeably with anonymized, a concept that covers any data that cannot be linked back to an individual. The need for harmonization of terminology is clear. But what identifiability classification system would best help law and policymakers to regulate de-identification and researchers to understand it? We believe that ‘anonymized data’ (or ‘anonymous data’) should mean data that cannot reasonably foreseeably be reidentified, alone or in combination with other data. ‘Pseudonymized data’ (often referred to as ‘coded data’) should mean data that can only be re-identified with access to a deliberately crafted re-identification mechanism. This pseudonymization mechanism can be singleor double-coding, encryption and tokenization, with appropriate safeguards in place. Data that can be reidentified using quasi-identifiers, however, are not pseudonymized. In light of occasional but recurring confusion in the literature on this point, we stress that the mere substitution of direct identifiers with a re-identification mechanism does not result in pseudonymized data, unless it is also shown that its quasiidentifiers do not allow re-identification. Otherwise, the data remain identifiable, and fall within some definitions of ‘masked’ data17. When the data include plain-text direct identifiers, it is ‘identified’. Given the emergence of increasingly sophisticated re-identification attacks8,21–25, it is now only reasonable to consider genetic data to be anonymized or pseudonymized in narrow circumstances, though we disagree with literature suggesting that anonymization should be abandoned altogether6,7. Though even aggregate statistics can allow re-identification of a data set, at some level of generality this ceases to be the case (for example, percentages of US people with a particular single nucleotide variant). The time when the mere removal of direct identifiers was considered defensible anonymization26 is past. Our dual schema accommodates new techniques, such as secure multiparty computing, homomorphic encryption, k-anonymity, and differential privacy27, without having to explicitly refer to any of them, by making identifiability determinations on a case-by-case, contextual basis28. In short, although the details of and difference between techniques to limit identifiability will necessarily be highly significant to technicians applying them to a given data set, our view is that from the perspective of policymakers, the distinctions that are of the highest significance are almost always whether the data have been anonymized or pseudonymized. As to ‘deidentification’ itself, we believe that the adjective ‘de-identified’ is ambiguous and confusing to the degree that it should be avoided altogether, whereas the verb ‘deidentify’ is acceptable to describe any process aiming to limit the identifiability of personal data. Given the sea of confusion in which the terminology describing identifiability finds The discombobulation of de-identification CORRESPONDENCE
[1]
F TerrySharon,et al.
The global alliance for genomics & health.
,
2014
.
[2]
V. Stodden,et al.
Privacy, Big Data, and the Public Good: Frameworks For Engagement
,
2014
.
[3]
J Jaap Nietfeld.
What is anonymous?
,
2007,
EMBO reports.
[4]
Helen Nissenbaum,et al.
Big Data’s End Run around Anonymity and Consent
,
2014,
Book of Anonymity.
[5]
Yin Yang,et al.
Deterministic identification of specific individuals from GWAS results
,
2015,
Bioinform..
[6]
S. Nelson,et al.
Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays
,
2008,
PLoS genetics.
[7]
Jan O. Korbel,et al.
Data analysis: Create a cloud commons
,
2015,
Nature.
[8]
Zhen Lin,et al.
Genomic Research and Human Subject Privacy
,
2004,
Science.
[9]
Eran Halperin,et al.
Identifying Personal Genomes by Surname Inference
,
2013,
Science.
[10]
Fred H. Cate,et al.
Data Protection Principles for the 21st Century
,
2013
.
[11]
Bradley Malin,et al.
How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems
,
2004,
J. Biomed. Informatics.
[12]
S. Solomon.
Ethical Challenges to Next-Generation Sequencing
,
2015
.
[13]
K. Hao,et al.
Bayesian method to predict individual SNP genotypes from gene expression data
,
2012,
Nature Genetics.
[14]
Joanna Abadie,et al.
Article 29 Working Party
,
2016
.
[15]
Bartha Maria Knoppers,et al.
The Babel of genetic data terminology
,
2005,
Nature Biotechnology.