A Review on clustering and visualization methodologies for Genomic data analysis

This abstract presents a survey on the aims, the problems and the methods concerning Cluster Analysis and its applications in genomic data analysis. With the term Cluster Analysis we refer to a data exploration tool whose goal is grouping objects of similar kind into their respective categories without a priori information on their classes. We can look at cluster analysis as a classification problem with no labeled samples, or without any a priori knowledge about the way the objects have to be put together. There are several and heterogeneous problems linked to the cluster analysis and several times they are treated separately. In this work we examine these problems, and we illustrate the different approaches and their applications to Computational Biology and Bioinformatics. The problems related to Cluster Analysis in the context of high-dimensional genomic data analysis can be summarized as shown in figure 1 . In this figure each node represents an item of the data exploration problem via cluster analysis or computational methods used in this kind of data analysis. The edges of this graph can be mono-directional or bi-directional and, following a path (according the edge directions), one can see a sequence of steps toward the final goal of cluster analysis and the relationships between different problems and computational methods involved in unsupervised genomic data analysis.

[1]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[2]  R. J. J. H. van Son,et al.  A method to quantify the error distribution in confusion matrices , 1995, EUROSPEECH.

[3]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[4]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[5]  Richard M. Simon,et al.  Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data , 2002, Bioinform..

[6]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[7]  Benno Stein,et al.  On Cluster Validity and the Information Need of Users , 2003 .

[8]  Debashis Ghosh,et al.  Cluster stability scores for microarray data in cancer studies , 2003, BMC Bioinformatics.

[9]  Yoichi Takenaka,et al.  Graph-based clustering for finding distant relationships in a large set of protein sequences , 2004, Bioinform..

[10]  Francisco Azuaje,et al.  An integrated tool for microarray data clustering and cluster validity assessment , 2004, SAC '04.

[11]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[12]  A. Bertoni,et al.  Random projections for assessing gene expression cluster stability , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[13]  Ka Yee Yeung,et al.  Context-specific infinite mixtures for clustering gene expression profiles across diverse microarray dataset , 2006, Bioinform..

[14]  Shai Ben-David,et al.  A Sober Look at Clustering Stability , 2006, COLT.

[15]  Satoru Miyano,et al.  ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles , 2006, Bioinform..

[16]  Giorgio Valentini,et al.  Model order selection for bio-molecular data clustering , 2007, BMC Bioinformatics.

[17]  Giorgio Valentini,et al.  Discovering Significant Structures in Clustered Bio-molecular Data Through the Bernstein Inequality , 2007, KES.