Gene expression data clustering analysis: A survey

The advent of DNA microarray technology has enabled biologists to monitor the expression levels (MRNA) of thousands of genes simultaneously. In this survey, we address various approaches to gene expression data analysis using clustering techniques. We discuss the performance of various existing clustering algorithms under each of these approaches. Proximity measure plays an important role in making a clustering technique effective. Therefore, we briefly discuss various proximity measures. Finally, since evaluation of the effectiveness of the clustering techniques over gene data requires validity measures and data sources for numeric data, we discuss them as well.

[1]  Rasiah Loganantharaj,et al.  Beyond clustering of array expressions , 2009, Int. J. Bioinform. Res. Appl..

[2]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[3]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[4]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[5]  Vito Di Gesù,et al.  GenClust: A genetic algorithm for clustering gene expression data , 2005, BMC Bioinformatics.

[6]  Taeho Hwang,et al.  CLIC: clustering analysis of large microarray datasets with individual dimension-based clustering , 2010, Nucleic Acids Res..

[7]  Mu-Chun Su,et al.  Fast self-organizing feature map algorithm , 2000, IEEE Trans. Neural Networks Learn. Syst..

[8]  Ron Shamir,et al.  CLICK and EXPANDER: a system for clustering and visualizing gene expression data , 2003, Bioinform..

[9]  Ickjai Lee,et al.  AMOEBA: HIERARCHICAL CLUSTERING BASED ON SPATIAL PROXIMITY USING DELAUNATY DIAGRAM , 2000 .

[10]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[11]  Jugal K. Kalita,et al.  An incremental clustering of gene expression data , 2009, 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC).

[12]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[13]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Jesús S. Aguilar-Ruiz,et al.  Incremental wrapper-based gene selection from microarray data for cancer classification , 2006, Pattern Recognit..

[15]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[16]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[17]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[18]  M. Eisen,et al.  Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering , 2002, Genome Biology.

[19]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[20]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[21]  M. Narasimha Murty,et al.  Genetic K-means algorithm , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[22]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[23]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[24]  M. Narasimha Murty,et al.  A near-optimal initial seed value selection in K-means means algorithm using a genetic algorithm , 1993, Pattern Recognit. Lett..

[25]  Ron Shamir,et al.  Scoring clustering solutions by their biological relevance , 2003, Bioinform..

[26]  Chunlei Wu,et al.  TCLUST: A Fast Method for Clustering Genome-Scale Expression Data , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[27]  Yang Shi,et al.  Dynamic regulation of histone lysine methylation by demethylases. , 2007, Molecular cell.

[28]  Hans-Peter Kriegel,et al.  Incremental Clustering for Mining in a Data Warehousing Environment , 1998, VLDB.

[29]  Jian Pei,et al.  DHC: a density-based hierarchical clustering method for time series gene expression data , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[30]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[31]  Francis D. Gibbons,et al.  Judging the quality of gene expression-based clustering methods using gene annotation. , 2002, Genome research.

[32]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[33]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[34]  Jin Hwan Do,et al.  Clustering approaches to identifying gene expression patterns from DNA microarray data. , 2008, Molecules and cells.

[35]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Fang-Xiang Wu,et al.  Determination of the minimum number of microarray experiments for discovery of gene expression patterns , 2006, BMC Bioinformatics.

[37]  Jian Pei,et al.  Interactive exploration of coherent patterns in time-series gene expression data , 2003, KDD '03.

[38]  Esa Alhoniemi,et al.  Clustering of the self-organizing map , 2000, IEEE Trans. Neural Networks Learn. Syst..

[39]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[40]  Lin-Yu Tseng,et al.  A genetic approach to the automatic clustering problem , 2001, Pattern Recognit..

[41]  S. Bull,et al.  A hierarchical clustering method for estimating copy number variation. , 2007, Biostatistics.

[42]  Isak Gath,et al.  Unsupervised Optimal Fuzzy Clustering , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[43]  P. Kersten,et al.  Implementation issues in the fuzzy c-medians clustering algorithm , 1997, Proceedings of 6th International Fuzzy Systems Conference.

[44]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[45]  Manabu Kotani,et al.  Analysis of gene expression data by using self-organizing maps and k-means clustering , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[46]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[47]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[48]  Tao Li,et al.  HIREL: An Incremental Clustering Algorithm for Relational Datasets , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[49]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[50]  Denis Mestivier,et al.  AutoClass@IJM: a powerful tool for Bayesian classification of heterogeneous data in biology , 2009, Nucleic Acids Res..

[51]  Douglas H. Fisher,et al.  Improving Inference through Conceptual Clustering , 1987, AAAI.

[52]  J. Kalita,et al.  An Effective Dissimilarity Measure for Clustering Gene Expression Time Series Data , 2022 .

[53]  Nikhil Garge,et al.  ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use , 2008, BMC Bioinformatics.

[54]  K. Becker,et al.  Analysis of microarray data using Z score transformation. , 2003, The Journal of molecular diagnostics : JMD.

[55]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[56]  Ujjwal Maulik,et al.  An improved algorithm for clustering gene expression data , 2007, Bioinform..

[57]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[58]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[59]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[60]  M. Narasimha Murty,et al.  Rough set based incremental clustering of interval data , 2006, Pattern Recognit. Lett..

[61]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[62]  Jugal K. Kalita,et al.  A new approach for clustering gene expression time series data , 2009, Int. J. Bioinform. Res. Appl..

[63]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[64]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[65]  Anbupalam Thalamuthu,et al.  Gene expression Evaluation and comparison of gene clustering methods in microarray analysis , 2006 .

[66]  J. Kalita,et al.  A Frequent Itemset – Nearest Neighbor Based Approach for Clustering Gene Expression Data , 2009 .

[67]  Yi Lu,et al.  Incremental genetic K-means algorithm and its application in gene expression data analysis , 2004, BMC Bioinformatics.

[68]  A.M. Yip,et al.  Strategies for Identifying Statistically Significant Dense Regions in Microarray Data , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[69]  Susmita Datta,et al.  Comparisons and validation of statistical clustering techniques for microarray gene expression data , 2003, Bioinform..

[70]  Roded Sharan,et al.  CLICK: A Clustering Algorithm for Gene Expression Analysis , 2000, ISMB 2000.

[71]  Doulaye Dembélé,et al.  Fuzzy C-means Method for Clustering Microarray Data , 2003, Bioinform..

[72]  Yoichi Nakazato,et al.  Systematic immunohistochemical profiling of 378 brain tumors with 37 antibodies using tissue microarray technology , 2006, Acta Neuropathologica.

[73]  Wendy R. Fox,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[74]  T Watson Layne,et al.  A Genetic Algorithm Approach to Cluster Analysis , 1998 .

[75]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[76]  M. Narasimha Murty,et al.  Clustering with evolution strategies , 1994, Pattern Recognit..

[77]  Roded Sharan,et al.  Algorithmic approaches to clustering gene expression data , 2001 .

[78]  Ickjai Lee,et al.  AUTOCLUST: Automatic Clustering via Boundary Extraction for Mining Massive Point-Data Sets , 2000 .

[79]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[80]  D. Lockhart,et al.  Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[81]  Ujjwal Maulik,et al.  Genetic algorithm-based clustering technique , 2000, Pattern Recognit..

[82]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[83]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[84]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[85]  Abdelghani Bellaachia,et al.  E-CAST: A Data Mining Algorithm for Gene Expression Data , 2002, BIOKDD.

[86]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[87]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[88]  Christian Böhm,et al.  Density connected clustering with local subspace preferences , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[89]  Taizo Hanai,et al.  Analysis of expression profile using fuzzy adaptive resonance theory , 2002, Bioinform..

[90]  Ying Liu,et al.  Text Mining Biomedical Literature for Discovering Gene-to-Gene Relationships: A Comparative Study of Algorithms , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.