Hierarchical Clustering of High- Throughput Expression Data Based on General Dependences

High-throughput expression technologies, including gene expression array and liquid chromatography--mass spectrometry (LC-MS) and so on, measure thousands of features, i.e., genes or metabolites, on a continuous scale. In such data, both linear and nonlinear relations exist between features. Nonlinear relations can reflect critical regulation patterns in the biological system. However, they are not identified and utilized by traditional clustering methods based on linear associations. Clustering based on general dependences, i.e., both linear and nonlinear relations, is hampered by the high dimensionality and high noise level of the data. We developed a sensitive nonparametric measure of general dependence between (groups of) random variables in high dimensions. Based on this dependence measure, we developed a hierarchical clustering method. In simulation studies, the method outperformed correlation- and mutual information (MI)-based hierarchical clustering methods in clustering features with nonlinear dependences. We applied the method to a microarray data set measuring the gene expression in cell-cycle time series to show it generates biologically relevant results. The R code is available at http://userwww.service.emory.edu/~tyu8/GDHC.

[1]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[2]  Klaus Nordhausen,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition by Trevor Hastie, Robert Tibshirani, Jerome Friedman , 2009 .

[3]  Tianwei Yu,et al.  Capturing changes in gene expression dynamics by gene set differential coordination analysis. , 2011, Genomics.

[4]  Korbinian Strimmer,et al.  Identifying periodically expressed transcripts in microarray time series data , 2008, Bioinform..

[5]  Tianwei Yu,et al.  Improving gene expression data interpretation by finding latent factors that co-regulate gene modules with clinical factors , 2011, BMC Genomics.

[6]  Pedro Delicado,et al.  Measuring non-linear dependence for two random variables distributed along a curve , 2009, Stat. Comput..

[7]  Tianwei Yu,et al.  Study of coordinative gene expression at the biological process level , 2005, Bioinform..

[8]  Tianwei Yu,et al.  An exploratory data analysis method to reveal modular latent structures in high-throughput data , 2010, BMC Bioinformatics.

[9]  Edith D. Wong,et al.  Saccharomyces Genome Database: the genomics resource of budding yeast , 2011, Nucleic Acids Res..

[10]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[11]  B. Hammock,et al.  Mass spectrometry-based metabolomics. , 2007, Mass spectrometry reviews.

[12]  William J. Cook,et al.  Chained Lin-Kernighan for Large Traveling Salesman Problems , 2003, INFORMS Journal on Computing.

[13]  Tianwei Yu,et al.  A practical approach to detect unique metabolic patterns for personalized medicine. , 2010, The Analyst.

[14]  Bin Zhang,et al.  Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R , 2008, Bioinform..

[15]  Dennis B. Troup,et al.  NCBI GEO: mining tens of millions of expression profiles—database and tools update , 2006, Nucleic Acids Res..

[16]  Gianluca Bontempi,et al.  minet: A R/Bioconductor Package for Inferring Large Transcriptional Networks Using Mutual Information , 2008, BMC Bioinformatics.

[17]  Tianwei Yu,et al.  ROCS: Receiver Operating Characteristic Surface for Class-Skewed High-Throughput Data , 2012, PloS one.

[18]  Tianwei Yu,et al.  Incorporating Nonlinear Relationships in Microarray Missing Value Imputation , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[19]  Robert Gentleman,et al.  Using GOstats to test gene lists for GO term association , 2007, Bioinform..

[20]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[21]  Abraham P. Punnen,et al.  The traveling salesman problem and its variations , 2007 .

[22]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[23]  Hui Ye,et al.  A forward-backward fragment assembling algorithm for the identification of genomic amplification and deletion breakpoints using high-density single nucleotide polymorphism (SNP) array , 2007, BMC Bioinformatics.

[24]  P. Billingsley,et al.  Probability and Measure , 1980 .

[25]  Tianwei Yu,et al.  apLCMS - adaptive processing of high-resolution LC/MS data , 2009, Bioinform..

[26]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[27]  Michael A. Black,et al.  Microarray-based gene set analysis: a comparison of current methods , 2008, BMC Bioinformatics.

[28]  Peter J. Woolf,et al.  Learning transcriptional regulatory networks from high throughput gene expression data using continuous three-way mutual information , 2008, BMC Bioinformatics.

[29]  Adam Krzyzak,et al.  Learning and Design of Principal Curves , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[31]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[32]  Mohak Shah,et al.  A General Framework for Analyzing Data from Two Short Time-Series Microarray Experiments , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  Jianhong Wu,et al.  Data clustering - theory, algorithms, and applications , 2007 .

[34]  Ming Yan,et al.  A simple statistical model for depicting the cdc-15 synchronized yeast cell cycle-regulated gene expression data , 2002 .

[35]  Nicola J. Rinaldi,et al.  Computational discovery of gene modules and regulatory networks , 2003, Nature Biotechnology.

[36]  Rebecca Nugent,et al.  An overview of clustering applied to molecular biology. , 2010, Methods in molecular biology.

[37]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[38]  Tianwei Yu,et al.  Inference of transcriptional regulatory network by two-stage constrained space factor analysis , 2005, Bioinform..

[39]  H. Joe Relative Entropy Measures of Multivariate Dependence , 1989 .

[40]  Ker-Chau Li,et al.  A system for enhancing genome-wide coexpression dynamics study. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Takafumi Kanamori,et al.  Mutual information estimation reveals global associations between stimuli and biological processes , 2009, BMC Bioinformatics.