Minimum entropy clustering and applications to gene expression analysis

Clustering is a common methodology for analyzing the gene expression data. We present a new clustering algorithm from an information-theoretic point of view. First, we propose the minimum entropy (measured on a posteriori probabilities) criterion, which is the conditional entropy of clusters given the observations. Pane's inequality indicates that it could be a good criterion for clustering. We generalize the criterion by replacing Shannon's entropy with Havrda-Charvat's structural /spl alpha/-entropy. Interestingly, the minimum entropy criterion based on structural /spl alpha/-entropy is equal to the probability error of the nearest neighbor method when /spl alpha/ = 2. This is another evidence that the proposed criterion is good for clustering. With a nonparametric approach for estimating a posteriori probabilities, an efficient iterative algorithm is then established to minimize the entropy. The experimental results show that the clustering algorithm performs significantly better than k-means/medians, hierarchical clustering, SOM, and EM in terms of adjusted Rand index. Particularly, our algorithm performs very well even when the correct number of clusters is unknown. In addition, most clustering algorithms produce poor partitions in presence of outliers while our method can correctly reveal the structure of data and effectively identify outliers simultaneously.

[1]  Adrian E. Raftery,et al.  MCLUST: Software for Model-Based Clustering, Density Estimation and Discriminant Analysis , 2002 .

[2]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[3]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[4]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[5]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[6]  Roger E Bumgarner,et al.  Clustering gene-expression data with repeated measurements , 2003, Genome Biology.

[7]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1997, Texts in Computer Science.

[8]  Hongjuan Zhao,et al.  Genome‐wide characterization of gene expression variations and DNA copy number changes in prostate cancer cell lines , 2005, The Prostate.

[9]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[10]  C. Quesenberry,et al.  A nonparametric estimate of a multivariate density function , 1965 .

[11]  A. Rényi On Measures of Entropy and Information , 1961 .

[12]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[13]  Jan Havrda,et al.  Quantification method of classification processes. Concept of structural a-entropy , 1967, Kybernetika.

[14]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[15]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[16]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[17]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[18]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .

[19]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[21]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[22]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[23]  L. Hubert,et al.  Comparing partitions , 1985 .

[24]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[25]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[26]  Roger E Bumgarner,et al.  Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. , 2001, Science.

[27]  Jagat Narain Kapur,et al.  Measures of information and their applications , 1994 .

[28]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[29]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[30]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[31]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .