On the privacy risks of sharing clinical proteomics data

Although the privacy issues in human genomic studies are well known, the privacy risks in clinical proteomic data have not been thoroughly studied. As a proof of concept, we reported a comprehensive analysis of the privacy risks in clinical proteomic data. It showed that a small number of peptides carrying the minor alleles (referred to as the minor allelic peptides) at non-synonymous single nucleotide polymorphism (nsSNP) sites can be identified in typical clinical proteomic datasets acquired from the blood/serum samples of individual patient, from which the patient can be identified with high confidence. Our results suggested the presence of significant privacy risks in raw clinical proteomic data. However, these risks can be mitigated by a straightforward pre-processing step of the raw data that removing a very small fraction (0.1%, 7.14 out of 7,504 spectra on average) of MS/MS spectra identified as the minor allelic peptides, which has little or no impact on the subsequent analysis (and re-use) of these datasets.

[1]  Manfred Kayser,et al.  Forensic DNA Phenotyping: Predicting human appearance from crime scene material for investigative purposes. , 2015, Forensic science international. Genetics.

[2]  M. Angrist Eyes wide open: the personal genome project, citizen science and veracity in informed consent. , 2009, Personalized medicine.

[3]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[4]  D. Clayton On inferring presence of an individual in a mixture: a Bayesian approach , 2010, Biostatistics.

[5]  Haixu Tang,et al.  Learning your identity and disease from research papers: information leaks in genome wide association study , 2009, CCS.

[6]  James P. Reilly,et al.  A computational approach toward label-free protein quantification using predicted peptide detectability , 2006, ISMB.

[7]  J. Yates,et al.  Large-scale analysis of the yeast proteome by multidimensional protein identification technology , 2001, Nature Biotechnology.

[8]  Rui Zhu,et al.  LC‐MS/MS‐based serum proteomics for identification of candidate biomarkers for hepatocellular carcinoma , 2015, Proteomics.

[9]  Zhen Lin,et al.  Genomic Research and Human Subject Privacy , 2004, Science.

[10]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[11]  Joshua E. Elias,et al.  Target-Decoy Search Strategy for Mass Spectrometry-Based Proteomics , 2010, Proteome Bioinformatics.

[12]  Fan Zhang,et al.  Discovery of pathway biomarkers from coupled proteomics and systems biology methods , 2010, BMC Genomics.

[13]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[14]  W. Pao,et al.  A Bioinformatics Workflow for Variant Peptide Detection in Shotgun Proteomics* , 2011, Molecular & Cellular Proteomics.

[15]  Jinghui Zhang,et al.  Needles in the Haystack: Identifying Individuals Present in Pooled Genomic Data , 2009, PLoS genetics.

[16]  Michael I. Jordan,et al.  Genomic privacy and limits of individual detection in a pool , 2009, Nature Genetics.

[17]  M. Ebert,et al.  Advances in clinical cancer proteomics: SELDI-ToF-mass spectrometry and biomarker discovery. , 2005, Briefings in functional genomics & proteomics.

[18]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[19]  Haixu Tang,et al.  Computational framework for identification of intact glycopeptides in complex samples. , 2014, Analytical chemistry.

[20]  Lei Zhang,et al.  Accurate Qualitative and Quantitative Proteomic Analysis of Clinical Hepatocellular Carcinoma Using Laser Capture Microdissection Coupled with Isotope-coded Affinity Tag and Two-dimensional Liquid Chromatography Mass Spectrometry* , 2004, Molecular & Cellular Proteomics.

[21]  Elizabeth M. Smigielski,et al.  dbSNP: a database of single nucleotide polymorphisms , 2000, Nucleic Acids Res..

[22]  Melissa J. Landrum,et al.  RefSeq: an update on mammalian reference sequences , 2013, Nucleic Acids Res..

[23]  James A Hill,et al.  ProteomeCommons.org collaborative annotation and project management resource integrated with the Tranche repository. , 2010, Journal of proteome research.

[24]  Henry H. N. Lam,et al.  PeptideAtlas: a resource for target selection for emerging targeted proteomics workflows , 2008, EMBO reports.

[25]  Michael R. Shortreed,et al.  Large-scale mass spectrometric detection of variant peptides resulting from nonsynonymous nucleotide differences. , 2014, Journal of proteome research.

[26]  Nuno Bandeira,et al.  False discovery rates in spectral identification , 2012, BMC Bioinformatics.

[27]  Stephen E. Fienberg,et al.  Privacy Preserving GWAS Data Sharing , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[28]  Andrew R. Jones,et al.  ProteomeXchange provides globally co-ordinated proteomics data submission and dissemination , 2014, Nature Biotechnology.

[29]  Pavel A. Pevzner,et al.  Universal database search tool for proteomics , 2014, Nature Communications.

[30]  Johannes Griss,et al.  The Proteomics Identifications (PRIDE) database and associated tools: status in 2013 , 2012, Nucleic Acids Res..