Validation of Gene Expression Profiles in Genomic Data through Complementary Use of Cluster Analysis and PCA-Related Biplots

High-throughput genomic assays are used in molecular biology to explore patterns of joint expression of thousands of genes. These methodologies had relevant developments in the last decade, and concurrently there was a need for appropriate methods for analyzing the massive data generated. Identifying sets of genes and samples characterized by similar values of expression and validating these results are two critical issues related to these investigations because of their clinical implication. From a statistical perspective, unsupervised class discovery methods like Cluster Analysis are generally adopted. However, the use of Cluster Analysis mainly relies on the use of hierarchical techniques without considering possible use of other methods. This is partially due to software availability and to easiness of representation of results through a heatmap, which allows to simultaneously visualize clusterization of genes and samples on the same graphical device. One drawback of this strategy is that clusters’ stability is often neglected, thus leading to over-interpretation of results. Moreover, validation of results using external datasets is still subject of discussion, since it is well known that batch effects may condition gene expression results even after normalization. In this paper we compared several clustering algorithms (hierarchical, k-means, model-based, Affinity Propagation) and stability indices to discover common patterns of expression and to assess clustering reliability, and propose a rank-based passive projection of Principal Components for validation purposes. Results from a study involving 23 tumor cell lines and 76 genes related to a specific biological pathway and derived from a publicly available dataset, are presented.

[1]  Sean J. Morrison,et al.  Asymmetric and symmetric stem-cell divisions in development and cancer , 2006, Nature.

[2]  Susmita Datta,et al.  Comparisons and validation of statistical clustering techniques for microarray gene expression data , 2003, Bioinform..

[3]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Gerhard Christofori,et al.  Cell adhesion and signalling by cadherins and Ig-CAMs in cancer , 2004, Nature Reviews Cancer.

[5]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[6]  Elizabeth Garrett-Mayer,et al.  Cross-study validation and combined analysis of gene expression microarray data. , 2007, Biostatistics.

[7]  S. Hellman,et al.  Separating favorable from unfavorable prognostic markers in breast cancer: the role of E-cadherin. , 2000, Cancer research.

[8]  Giovanni Parmigiani,et al.  A Cross-Study Comparison of Gene Expression Studies for the Molecular Classification of Lung Cancer , 2004, Clinical Cancer Research.

[9]  D. Winter,et al.  Oestrogen and the colon: potential mechanisms for cancer prevention. , 2008, The Lancet. Oncology.

[10]  E. Wit Design and Analysis of DNA Microarray Investigations , 2004, Human Genomics.

[11]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[12]  J. Clements,et al.  Epithelial—mesenchymal and mesenchymal—epithelial transitions in carcinoma progression , 2007, Journal of cellular physiology.

[13]  Valeri Vasioukhin,et al.  Cell polarity and cancer – cell and tissue polarity as a non-canonical tumor suppressor , 2008, Journal of Cell Science.

[14]  Stuart A. Aaronson,et al.  Exogenous Expression of N-Cadherin in Breast Cancer Cells Induces Cell Migration, Invasion, and Metastasis , 2000, The Journal of cell biology.

[15]  Jonathan M. Garibaldi,et al.  Cancer Profiles by Affinity Propagation , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[16]  Debashis Ghosh,et al.  Cluster stability scores for microarray data in cancer studies , 2003, BMC Bioinformatics.

[17]  Richard M. Simon,et al.  Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data , 2002, Bioinform..

[18]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[19]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[20]  T. Rowlands,et al.  Cadherins and catenins in breast cancer. , 2005, Current opinion in cell biology.

[21]  K. Gabriel,et al.  The biplot graphic display of matrices with application to principal component analysis , 1971 .

[22]  Federico Ambrogi,et al.  Challenges in projecting clustering results across gene expression-profiling datasets. , 2007, Journal of the National Cancer Institute.

[23]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[24]  D. Botstein,et al.  A gene expression database for the molecular pharmacology of cancer , 2000, Nature Genetics.

[25]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[26]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[27]  F. Portillo,et al.  Transcriptional regulation of cell polarity in EMT and cancer , 2008, Oncogene.

[28]  Scott Chapman,et al.  Using biplots to interpret gene expression patterns in plants , 2002, Bioinform..

[29]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[30]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[31]  E. Lander Array of hope , 1999, Nature Genetics.

[32]  Andreas Scherer,et al.  Batch Effects and Noise in Microarray Experiments: Sources and Solutions , 2009 .

[33]  M. Bracken,et al.  E-cadherin Immunohistochemical Expression as a Prognostic Factor in Infiltrating Ductal Carcinoma of the Breast: a Systematic Review and Meta-Analysis , 2006, Breast Cancer Research and Treatment.