Quantification of private information leakage from phenotype-genotype data: linking attacks

Studies on genomic privacy have traditionally focused on identifying individuals using DNA variants. In contrast, molecular phenotype data, such as gene expression levels, are generally assumed to be free of such identifying information. Although there is no explicit genotypic information in phenotype data, adversaries can statistically link phenotypes to genotypes using publicly available genotype-phenotype correlations such as expression quantitative trait loci (eQTLs). This linking can be accurate when high-dimensional data (i.e., many expression levels) are used, and the resulting links can then reveal sensitive information (for example, the fact that an individual has cancer). Here we develop frameworks for quantifying the leakage of characterizing information from phenotype data sets. These frameworks can be used to estimate the leakage from large data sets before release. We also present a general three-step procedure for practically instantiating linking attacks and a specific attack using outlier gene expression levels that is simple yet accurate. Finally, we describe the effectiveness of this outlier attack under different scenarios.

[1]  G. Church,et al.  From genetic privacy to open consent , 2008, Nature Reviews Genetics.

[2]  Eric D Green,et al.  The Complexities of Genomic Identifiability , 2013, Science.

[3]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[4]  Latanya Sweeney,et al.  Identifying Participants in the Personal Genome Project by Name , 2013, ArXiv.

[5]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[6]  G. Church,et al.  Public Access to Genome-Wide Data: Five Views on Balancing Research with Privacy and Protection , 2009, PLoS genetics.

[7]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[8]  Roger D. Cox,et al.  A Mouse Model for the Metabolic Effects of the Human Fat Mass and Obesity Associated FTO Gene , 2009, PLoS genetics.

[9]  Philippe Golle,et al.  Revisiting the uniqueness of simple demographics in the US population , 2006, WPES '06.

[10]  Raymond K. Auerbach,et al.  The real cost of sequencing: higher than you think! , 2011, Genome Biology.

[11]  Somesh Jha,et al.  Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing , 2014, USENIX Security Symposium.

[12]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[13]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[14]  N. Cox,et al.  On sharing quantitative trait GWAS results in an era of multiple-omics data and the limits of genomic privacy. , 2012, American journal of human genetics.

[15]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[16]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[17]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[18]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[19]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[20]  Andrey A. Shabalin,et al.  Matrix eQTL: ultra fast eQTL analysis via large matrix operations , 2011, Bioinform..

[21]  Kenneth K. Kidd,et al.  SNPs for a universal individual identification panel , 2010, Human Genetics.

[22]  Jun S. Liu,et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans , 2015, Science.

[23]  Yaniv Erlich,et al.  Routes for breaching and protecting genetic privacy , 2013 .

[24]  Nita A. Farahany,et al.  Redefining Genomic Privacy: Trust and Empowerment , 2014, bioRxiv.

[25]  Caixia Li,et al.  Forensic Identification Using a Multiplex Assay of 47 SNPs * , 2012, Journal of forensic sciences.

[26]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[27]  Thomas M. Cover,et al.  Elements of Information Theory: Cover/Elements of Information Theory, Second Edition , 2005 .

[28]  Adam Meyerson,et al.  On the complexity of optimal K-anonymity , 2004, PODS.

[29]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.