Gene microarray data analysis using parallel point-symmetry-based clustering

Identification of co-expressed genes is the central goal in microarray gene expression analysis. Point-symmetry-based clustering is an important unsupervised learning technique for recognising symmetrical convex- or non-convex-shaped clusters. To enable fast clustering of large microarray data, we propose a distributed time-efficient scalable approach for point-symmetry-based K-Means algorithm. A natural basis for analysing gene expression data using symmetry-based algorithm is to group together genes with similar symmetrical expression patterns. This new parallel implementation also satisfies linear speedup in timing without sacrificing the quality of clustering solution on large microarray data sets. The parallel point-symmetry-based K-Means algorithm is compared with another new parallel symmetry-based K-Means and existing parallel K-Means over eight artificial and benchmark microarray data sets, to demonstrate its superiority, in both timing and validity. The statistical analysis is also performed to establish the significance of this message-passing-interface based point-symmetry K-Means implementation. We also analysed the biological relevance of clustering solutions.

[1]  Peter S. Pacheco Parallel programming with MPI , 1996 .

[2]  Michael K. Ng,et al.  A semi-supervised approach to projected clustering with applications to microarray data , 2009, Int. J. Data Min. Bioinform..

[3]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[4]  D. Botstein,et al.  The transcriptional program of sporulation in budding yeast. , 1998, Science.

[5]  S. Bandyopadhyay,et al.  Nonparametric genetic clustering: comparison of validity indices , 2001, IEEE Trans. Syst. Man Cybern. Syst..

[6]  Yunhao Liu,et al.  Effectively Utilizing Global Cluster Memory for Large Data-Intensive Parallel Programs , 2006, IEEE Trans. Parallel Distributed Syst..

[7]  Julius T. Tou,et al.  Pattern Recognition Principles , 1974 .

[8]  Rainer Fuchs,et al.  Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters , 2001, Bioinform..

[9]  Chien-Hsing Chou,et al.  Short Papers , 2001 .

[10]  David Botstein,et al.  GO: : TermFinder--open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes , 2004, Bioinform..

[11]  Sanguthevar Rajasekaran Efficient parallel hierarchical clustering algorithms , 2005, IEEE Transactions on Parallel and Distributed Systems.

[12]  Wai-Ki Ching,et al.  A weighted Local Least Squares Imputation method for missing value estimation in microarray gene expression data , 2010, Int. J. Data Min. Bioinform..

[13]  Douglas A. Wolfe,et al.  Nonparametric Statistical Methods , 1973 .

[14]  Sanghamitra Bandyopadhyay,et al.  Analysis of Biological Data: A Soft Computing Approach , 2007, Science, Engineering, and Biology Informatics.

[15]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[16]  Yen-Liang Chen,et al.  An overlapping cluster algorithm to provide non-exhaustive clustering , 2006, Eur. J. Oper. Res..

[17]  Hong Yan,et al.  Spectral similarity for analysis of DNA microarray time-series data , 2006, Int. J. Data Min. Bioinform..

[18]  Srinivas Aluru,et al.  Space and time efficient parallel algorithms and software for EST clustering , 2003, IEEE Trans. Parallel Distributed Syst..

[19]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[20]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[21]  Weiguo Liu,et al.  Parallel Pattern-Based Systems for Computational Biology: A Case Study , 2006, IEEE Transactions on Parallel and Distributed Systems.

[22]  Joaquín Dopazo,et al.  FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes , 2004, Bioinform..

[23]  Robert Clarke,et al.  Biomarker identification by knowledge-driven multilevel ICA and motif analysis , 2009, Int. J. Data Min. Bioinform..

[24]  Dan A. Simovici,et al.  Several remarks on the metric space of genetic codes , 2012, Int. J. Data Min. Bioinform..

[25]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[27]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Sanghamitra Bandyopadhyay,et al.  Performance Evaluation of Some Symmetry-Based Cluster Validity Indexes , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[29]  Hesham H. Ali,et al.  Message Passing Clustering (MPC): a knowledge-based framework for clustering under biological constraints , 2008, Int. J. Data Min. Bioinform..

[30]  Sanghamitra Bandyopadhyay,et al.  A Point Symmetry-Based Clustering Technique for Automatic Evolution of Clusters , 2008, IEEE Transactions on Knowledge and Data Engineering.

[31]  Srinivas Aluru,et al.  Efficient clustering of large EST data sets on parallel computers. , 2003, Nucleic acids research.

[32]  Ron Shamir,et al.  CLICK and EXPANDER: a system for clustering and visualizing gene expression data , 2003, Bioinform..

[33]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[34]  Arnold L. Rosenberg,et al.  Bounded-Collision Memory-Mapping Schemes for Data Structures with Applications to Parallel Memories , 2007, IEEE Transactions on Parallel and Distributed Systems.

[35]  Michael F. Ochs,et al.  Matrix factorisation methods applied in microarray data analysis , 2010, Int. J. Data Min. Bioinform..

[36]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[37]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[38]  Doulaye Dembélé,et al.  Fuzzy C-means Method for Clustering Microarray Data , 2003, Bioinform..

[39]  Ujjwal Maulik,et al.  Performance Evaluation of Some Clustering Algorithms and Validity Indices , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  D. Wolfe,et al.  Nonparametric Statistical Methods. , 1974 .

[41]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[42]  Slobodan Vucetic,et al.  Improving accuracy of microarray classification by a simple multi-task feature selection filter , 2011, Int. J. Data Min. Bioinform..

[43]  Ujjwal Maulik,et al.  Improved differential evolution for microarray analysis , 2012, Int. J. Data Min. Bioinform..

[44]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[45]  Kuo-Liang Chung,et al.  Faster and more robust point symmetry-based K-means algorithm , 2007, Pattern Recognit..

[46]  Sanghamitra Bandyopadhyay,et al.  GAPS: A clustering method using a new point symmetry-based distance measure , 2007, Pattern Recognit..