Mechanisms for Hiding Sensitive Genotypes with Information-Theoretic Privacy

The growing availability of personal genomics services comes with increasing concerns for genomic privacy. Individuals may wish to withhold sensitive genotypes that contain critical health-related information when sharing their data with such services. A straightforward solution that masks only the sensitive genotypes does not ensure privacy due to the correlation structure within the genome. Here, we develop an informationtheoretic mechanism for masking sensitive genotypes, which ensures no information about the sensitive genotypes is leaked. We also propose an efficient algorithmic implementation of our mechanism for genomic data governed by hidden Markov models. Our work is a step towards more rigorous control of privacy in genomic data sharing.

[1]  Seyed Abolfazl Motahari,et al.  Private Shotgun DNA Sequencing , 2018, 2019 IEEE International Symposium on Information Theory (ISIT).

[2]  Gary K. Chen,et al.  Genotype imputation via matrix completion , 2013, Genome research.

[3]  Erman Ayday,et al.  Re-Identification of Individuals in Genomic Data-Sharing Beacons via Allele Inference , 2017, bioRxiv.

[4]  Camilla Hollanti,et al.  Private Information Retrieval from Coded Databases with Colluding Servers , 2016, SIAM J. Appl. Algebra Geom..

[5]  Michael Gastpar,et al.  Single-server Multi-message Private Information Retrieval with Side Information , 2018, 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[6]  Salim El Rouayheb,et al.  Preserving ON-OFF Privacy for Past and Future Requests , 2019, 2019 IEEE Information Theory Workshop (ITW).

[7]  Xiaoqian Jiang,et al.  HaploHide: A Data Hiding Framework for Privacy Enhanced Sharing of Personal Genetic Data , 2019, bioRxiv.

[8]  Alan M. Kwong,et al.  Next-generation genotype imputation service and methods , 2016, Nature Genetics.

[9]  Swanand Kadhe,et al.  Private Information Retrieval With Side Information , 2017, IEEE Transactions on Information Theory.

[10]  Han Mao Kiah,et al.  Codes for DNA Sequence Profiles , 2016, IEEE Trans. Inf. Theory.

[11]  Bonnie Berger,et al.  Enabling Privacy Preserving GWAS in Heterogeneous Human Populations , 2016, RECOMB.

[12]  Hyunghoon Cho,et al.  Privacy-Preserving Biomedical Database Queries with Optimal Privacy-Utility Trade-Offs. , 2020, Cell systems.

[13]  Stelvio Cimato,et al.  Encyclopedia of Cryptography and Security , 2005 .

[14]  Sennur Ulukus,et al.  The Capacity of Private Information Retrieval From Coded Databases , 2016, IEEE Transactions on Information Theory.

[15]  Mark Gerstein,et al.  Private information leakage from functional genomics data: Quantification with calibration experiments and reduction via data sanitization protocols , 2018, bioRxiv.

[16]  B. Browning,et al.  Haplotype phasing: existing methods and new developments , 2011, Nature Reviews Genetics.

[17]  Hua Sun,et al.  The Capacity of Private Information Retrieval , 2016, 2016 IEEE Global Communications Conference (GLOBECOM).

[18]  Stefan Katzenbeisser,et al.  Genomic Data Privacy and Security: Where We Stand and Where We Are Heading , 2017, IEEE Secur. Priv..

[19]  A. Hobolth,et al.  Ancestral Population Genomics: The Coalescent Hidden Markov Model Approach , 2009, Genetics.

[20]  Bane V. Vasic,et al.  Information theory and coding problems in genetics , 2004, Information Theory Workshop.

[21]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[22]  Shane A. McCarthy,et al.  Reference-based phasing using the Haplotype Reference Consortium panel , 2016, Nature Genetics.

[23]  Pramod Viswanath,et al.  Extremal Mechanisms for Local Differential Privacy , 2014, J. Mach. Learn. Res..

[24]  David Tse,et al.  Information Theory of DNA Shotgun Sequencing , 2012, IEEE Transactions on Information Theory.

[25]  M. Stephens,et al.  Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003, Genetics.

[26]  P. Visscher,et al.  On Jim Watson's APOE status: genetic information is hard to hide , 2009, European Journal of Human Genetics.

[27]  Stephen E. Fienberg,et al.  Privacy Preserving GWAS Data Sharing , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[28]  David J. Wu,et al.  Secure genome-wide association analysis using multiparty computation , 2018, Nature Biotechnology.

[29]  Yun S. Song Na Li and Matthew Stephens on Modeling Linkage Disequilibrium , 2016, Genetics.

[30]  Salim El Rouayheb,et al.  Private Information Retrieval From MDS Coded Data in Distributed Storage Systems , 2016, IEEE Transactions on Information Theory.

[31]  Seyed Abolfazl Motahari,et al.  Information Theory of Mixed Population Genome-Wide Association Studies , 2018, 2018 IEEE Information Theory Workshop (ITW).

[32]  Alan M. Kwong,et al.  A reference panel of 64,976 haplotypes for genotype imputation , 2015, Nature Genetics.

[33]  Salim El Rouayheb,et al.  ON-OFF Privacy with Correlated Requests , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[34]  Ilan Shomorony,et al.  Information-optimal genome assembly via sparse read-overlap graphs , 2016, Bioinform..

[35]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[36]  Sriram Vishwanath,et al.  Information-Theoretic Analysis of Haplotype Assembly , 2017, IEEE Transactions on Information Theory.

[37]  Dennis Grishin,et al.  Data privacy in the age of personal genomics , 2019, Nature Biotechnology.

[38]  Hyunghoon Cho,et al.  Emerging technologies towards enhancing privacy in genomic data sharing , 2019, Genome Biology.

[39]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.