Privacy Preserving Principal Component Analysis Clustering for Distributed Heterogeneous Gene Expression Datasets

In this paper, we present approaches to perform principal component analysis (PCA) clustering for distributed heterogeneous genomic datasets with privacy protection. The approaches allow data providers to collaborate together to identify gene profiles from a global viewpoint, and at the same time, protect the sensitive genomic data from possible privacy leaks. We then further develop a framework for privacy preserving PCA-based gene clustering, which includes two types of participants: data providers and a trusted central site (TCS). Two different methodologies are employed: Collective PCA (C-PCA) and Repeating PCA (R-PCA). The C-PCA requires local sites to transmit a sample of original data to the TCS and can be applied to any heterogeneous datasets. The R-PCA approach requires all local sites have the same or similar number of columns, but releases no original data. Experiments on five independent genomic datasets show that both C-PCA and R-PCA approaches maintain very good accuracy compared with the centralized scenario. DOI: 10.4018/978-1-4666-2653-9.ch014

[1]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[2]  Vincent Claveau,et al.  Translation of Biomedical Terms by Inferring Rewriting Rules , 2009, Information Retrieval in Biomedicine.

[3]  Abhinav Grover,et al.  Analysis and Prediction of DNA-Recognition by Zinc Finger Proteins: Applications in Genome Modification , 2011 .

[4]  George C. Tseng,et al.  Investigating Multi-cancer Biomarkers and Their Cross-predictability in the Expression Profiles of Multiple Cancer Types , 2009, Biomarker insights.

[5]  Andrew Chi-Chih Yao,et al.  How to generate and exchange secrets , 1986, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[6]  W. G. Hill,et al.  The Limits of Individual Identification from Sample Allele Frequencies: Theory and Statistical Analysis , 2009, PLoS genetics.

[7]  Rebecca N. Wright,et al.  Privacy-preserving distributed k-means clustering over arbitrarily partitioned data , 2005, KDD '05.

[8]  Omar Hameed,et al.  Estrogen and Progesterone Receptor Expression is not Always Specific for Mammary and Gynecologic Carcinomas: A Tissue Microarray and Pooled Literature Review Study , 2009, Applied immunohistochemistry & molecular morphology : AIMM.

[9]  A. Marchetti,et al.  Survival prediction of stage I lung adenocarcinomas by expression of 10 genes. , 2007, The Journal of clinical investigation.

[10]  Rafail Ostrovsky,et al.  Secure two-party k-means clustering , 2007, CCS '07.

[11]  G. Church,et al.  Public Access to Genome-Wide Data: Five Views on Balancing Research with Privacy and Protection , 2009, PLoS genetics.

[12]  Noor B. Dawany,et al.  Asymmetric microarray data produces gene lists highly predictive of research literature on multiple cancer types , 2010, BMC Bioinformatics.

[13]  Brian Everitt,et al.  Cluster analysis , 1974 .

[14]  Chris Clifton,et al.  Privacy-preserving k-means clustering over vertically partitioned data , 2003, KDD '03.

[15]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[16]  Douglas G Altman,et al.  Key Issues in Conducting a Meta-Analysis of Gene Expression Microarray Datasets , 2008, PLoS medicine.

[17]  N. Uma,et al.  A Hybrid Data Transformation Approach for Privacy Preserving Clustering of Categorical Data , 2007 .

[18]  Xin Li,et al.  Privacy Preserving Clustering for Distributed Homogeneous Gene Expression Data Sets , 2010, Int. J. Comput. Model. Algorithms Medicine.

[19]  George C. Tseng,et al.  Meta-analysis for pathway enrichment analysis when combining multiple genomic studies , 2010, Bioinform..

[20]  Rebecca N. Wright,et al.  Communication-Efficient Privacy-Preserving Clustering , 2010, Trans. Data Priv..

[21]  Andy Stergachis,et al.  Informatics for Medicines Management Systems in Resource-Limited Settings , 2012 .

[22]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[23]  Maarten Postma,et al.  Pharmacoinformatics and drug discovery technologies: Theories and applications , 2012 .

[24]  Aziz Mohaisen,et al.  Augmented Rotation‐Based Transformation for Privacy‐Preserving Data Clustering , 2010, ArXiv.

[25]  Mathieu Roche,et al.  Information Retrieval in Biomedicine - Natural Language Processing for Knowledge Integration , 2009, Information Retrieval in Biomedicine.

[26]  S. Eschrich,et al.  Gene expression profiles as predictors of poor outcomes in stage II colorectal cancer: A systematic review and meta-analysis. , 2009, Clinical colorectal cancer.

[27]  Oded Goldreich,et al.  On the Foundations of Modern Cryptography , 1997, CRYPTO.

[28]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[29]  John V Pearson,et al.  Microarray-based genome-wide association studies using pooled DNA. , 2011, Methods in molecular biology.

[30]  Haixu Tang,et al.  Learning your identity and disease from research papers: information leaks in genome wide association study , 2009, CCS.

[31]  Michael M. Shi,et al.  Technologies for Individual Genotyping , 2002 .

[32]  Eytan Ruppin,et al.  Meta-analysis of gene expression data: a predictor-based approach , 2007, Bioinform..

[33]  Jinghui Zhang,et al.  Needles in the Haystack: Identifying Individuals Present in Pooled Genomic Data , 2009, PLoS genetics.

[34]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[35]  Steven J. M. Jones,et al.  Meta-analysis and meta-review of thyroid cancer gene expression profiling studies identifies important diagnostic biomarkers. , 2006, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[36]  Bin Han,et al.  Human mitochondrial transcription factor A functions in both nuclei and mitochondria and regulates cancer cell growth. , 2011, Biochemical and biophysical research communications.

[37]  Hesham H. Ali,et al.  Bioinformatics: Concepts, Methodologies, Tools, and Applications , 2013 .

[38]  Olaf Wolkenhauer,et al.  Analysis of DNA microarray data. , 2004, Current topics in medicinal chemistry.

[39]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.

[40]  Luca Lenzi,et al.  TRAM (Transcriptome Mapper): database-driven creation and analysis of transcriptome maps from multiple sources , 2011, BMC Genomics.

[41]  Ying Di,et al.  MicroRNA Controlled Adenovirus Mediates Anti-Cancer Efficacy without Affecting Endogenous MicroRNA Activity , 2011, PloS one.

[42]  Xiao Sun,et al.  Meta-analysis of cancer gene-profiling data. , 2010, Methods in molecular biology.