Enabling Genomic-Phenomic Association Discovery without Sacrificing Anonymity

Health information technologies facilitate the collection of massive quantities of patient-level data. A growing body of research demonstrates that such information can support novel, large-scale biomedical investigations at a fraction of the cost of traditional prospective studies. While healthcare organizations are being encouraged to share these data in a de-identified form, there is hesitation over concerns that it will allow corresponding patients to be re-identified. Currently proposed technologies to anonymize clinical data may make unrealistic assumptions with respect to the capabilities of a recipient to ascertain a patients identity. We show that more pragmatic assumptions enable the design of anonymization algorithms that permit the dissemination of detailed clinical profiles with provable guarantees of protection. We demonstrate this strategy with a dataset of over one million medical records and show that 192 genotype-phenotype associations can be discovered with fidelity equivalent to non-anonymized clinical data.

[1]  Bradley Malin,et al.  How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems , 2004, J. Biomed. Informatics.

[2]  Robin C. Meili,et al.  Can electronic medical record systems transform health care? Potential health benefits, savings, and costs. , 2005, Health affairs.

[3]  Marylyn D. Ritchie,et al.  Identification of Genomic Predictors of Atrioventricular ConductionClinical Perspective , 2010 .

[4]  D. Blumenthal Stimulating the adoption of health information technology. , 2009, The West Virginia medical journal.

[5]  Alastair D Hay,et al.  Sharing patient data: competing demands of privacy, trust and research in primary care. , 2005, The British journal of general practice : the journal of the Royal College of General Practitioners.

[6]  Charles Safran,et al.  Toward a national framework for the secondary use of health data: an American Medical Informatics Association White Paper. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[7]  Bradley Malin,et al.  Evaluating re-identification risks with respect to the HIPAA privacy rule , 2010, J. Am. Medical Informatics Assoc..

[8]  Kathleen M. West,et al.  dbGaP Data Access Requests: A Call for Greater Transparency , 2011, Science Translational Medicine.

[9]  Bradley Malin,et al.  COAT: COnstraint-based anonymization of transactions , 2010, Knowledge and Information Systems.

[10]  D. Roden,et al.  Predicting Clopidogrel Response Using DNA Samples Linked to an Electronic Health Record , 2012, Clinical pharmacology and therapeutics.

[11]  Christel Daniel-Le Bozec,et al.  Integrating clinical research with the Healthcare Enterprise: From the RE-USE project to the EHR4CR platform , 2011, J. Biomed. Informatics.

[12]  Wendy A. Wolf,et al.  The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies , 2011, BMC Medical Genomics.

[13]  Jihoon Kim,et al.  iDASH: integrating data for analysis, anonymization, and sharing , 2012, J. Am. Medical Informatics Assoc..

[14]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[15]  Bradley Malin,et al.  Technical and Policy Approaches to Balancing Patient Privacy and Data Sharing in Clinical and Translational Research , 2010, Journal of Investigative Medicine.

[16]  Anand D. Sarwate,et al.  Protecting count queries in study design , 2012, J. Am. Medical Informatics Assoc..

[17]  Kelly Edwards,et al.  Building a chain of trust: using policy and practice to enhance trustworthy clinical data discovery and sharing , 2010, GTIP '10.

[18]  Melissa A. Basford,et al.  Predicting warfarin dosage in European-Americans and African-Americans using DNA samples linked to an electronic health record. , 2012, Pharmacogenomics.

[19]  M. Guyer,et al.  Charting a course for genomic medicine from base pairs to bedside , 2011, Nature.

[20]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[21]  George J Annas,et al.  Protecting genetic privacy. , 1994, Trial.

[22]  E. Clayton,et al.  Identifiability in biobanks: models, measures, and mitigation strategies , 2011, Human Genetics.

[23]  C. Chute,et al.  Electronic Medical Records for Genetic Research: Results of the eMERGE Consortium , 2011, Science Translational Medicine.

[24]  Marylyn D. Ritchie,et al.  PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations , 2010, Bioinform..

[25]  Tamir Tassa,et al.  k-Anonymization Revisited , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[26]  L. Wasserman,et al.  A Statistical Framework for Differential Privacy , 2008, 0811.2501.

[27]  Peng Liu,et al.  New threats to health data privacy , 2011, BMC Bioinformatics.

[28]  D. Roden,et al.  Development of a Large‐Scale De‐Identified DNA Biobank to Enable Personalized Medicine , 2008, Clinical pharmacology and therapeutics.

[29]  E. Balas,et al.  Improving clinical practice using clinical decision support systems: a systematic review of trials to identify features critical to success , 2005, BMJ : British Medical Journal.

[30]  M. Fornage,et al.  A Phenomics-Based Strategy Identifies Loci on APOC1, BRAP, and PLCG1 Associated with Metabolic Syndrome Phenotype Domains , 2011, PLoS genetics.

[31]  Jean-Pierre Corriveau,et al.  A globally optimal k-anonymity method for the de-identification of health data. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[32]  Melissa A. Basford,et al.  Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome- and phenome-wide studies. , 2011, American journal of human genetics.

[33]  I. Kohane Using electronic health records to drive discovery in disease genomics , 2011, Nature Reviews Genetics.

[34]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[35]  D. Blumenthal,et al.  The benefits of health information technology: a review of the recent literature shows predominantly positive results. , 2011, Health affairs.

[36]  Geraldine P Mineau,et al.  Biomedical databases: protecting privacy and promoting research. , 2003, Trends in biotechnology.

[37]  Joshua C. Denny,et al.  The disclosure of diagnosis codes can breach research participants' privacy , 2010, J. Am. Medical Informatics Assoc..

[38]  K. El Emam,et al.  Evaluating Common De-Identification Heuristics for Personal Health Information , 2006, Journal of medical Internet research.

[39]  P. Embí,et al.  Toward Reuse of Clinical Data for Research and Quality Improvement: The End of the Beginning? , 2009, Annals of Internal Medicine.

[40]  Melissa A. Basford,et al.  Identification of Genomic Predictors of Atrioventricular Conduction: Using Electronic Medical Records as a Tool for Genome Science , 2010, Circulation.

[41]  Dipak Kalra,et al.  Confidentiality of personal health information used for research , 2006, BMJ : British Medical Journal.

[42]  Panos Kalnis,et al.  Local and global recoding methods for anonymizing set-valued data , 2010, The VLDB Journal.

[43]  R. Lazarus,et al.  Viewpoint Paper: Electronic Support for Public Health: Validated Case Finding and Reporting for Notifiable Diseases Using Electronic Medical Data , 2009, J. Am. Medical Informatics Assoc..

[44]  B. Malin,et al.  Anonymization of electronic medical records for validating genome-wide association studies , 2010, Proceedings of the National Academy of Sciences.

[45]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[46]  C. Carlson,et al.  Genetic variants associated with the white blood cell count in 13,923 subjects in the eMERGE Network , 2011, Human Genetics.

[47]  C Kooperberg,et al.  The use of phenome‐wide association studies (PheWAS) for exploration of novel genotype‐phenotype relationships and pleiotropy discovery , 2011, Genetic epidemiology.

[48]  Christopher G. Chute,et al.  A Genome-Wide Association Study of Red Blood Cell Traits Using the Electronic Medical Record , 2010, PloS one.

[49]  Heather A. Piwowar,et al.  Towards a Data Sharing Culture: Recommendations for Leadership from Academic Health Centers , 2008, PLoS medicine.

[50]  Peter L. Elkin,et al.  Comparison of Natural Language Processing Biosurveillance Methods for Identifying Influenza From Encounter Notes , 2012, Annals of Internal Medicine.

[51]  K. Emam,et al.  Evaluating the Risk of Re-identification of Patients from Hospital Prescription Records. , 2009, The Canadian journal of hospital pharmacy.