A highly efficient multi-core algorithm for clustering extremely large datasets

BackgroundIn recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer.ResultsWe introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization.ConclusionsMost desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.

[1]  Rubing Duan,et al.  Data Mining Algorithms on the Cell Broadband Engine , 2008, Euro-Par.

[2]  A. Bertoni,et al.  Random projections for assessing gene expression cluster stability , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[3]  Mark Anderson Sun's Rock CPU Could Be a Gem for Oracle , 2009, IEEE Spectrum.

[4]  Jon Hill,et al.  SPRINT: A new parallel framework for R , 2008, BMC Bioinformatics.

[5]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[6]  Anthony Skjellum,et al.  Using MPI - portable parallel programming with the message-parsing interface , 1994 .

[7]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[8]  Christoforos E. Kozyrakis,et al.  Unlocking Concurrency , 2006, ACM Queue.

[9]  Pascal Costanza,et al.  Reusable building blocks for software transactional memory , 2009 .

[10]  Vivian G. Cheung,et al.  Genetic analysis of radiation-induced changes in human gene expression , 2009, Nature.

[11]  A W F Edwards,et al.  Human genetic diversity: Lewontin's fallacy. , 2003, BioEssays : news and reviews in molecular, cellular and developmental biology.

[12]  Xiaoyi Gao,et al.  Using Allele Sharing Distance for Detecting Human Population Stratification , 2009, Human Heredity.

[13]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[14]  Jill P. Mesirov,et al.  Portraits of breast cancer progression , 2007, BMC Bioinformatics.

[15]  Xiaoyi Gao,et al.  Human population structure detection via multilocus genotype clustering , 2007, BMC Genetics.

[16]  Nikhil Garge,et al.  ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use , 2008, BMC Bioinformatics.

[17]  David J. Groggel,et al.  Practical Nonparametric Statistics , 2000, Technometrics.

[18]  Anil K. Jain,et al.  Bootstrap technique in cluster analysis , 1987, Pattern Recognit..

[19]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[20]  Barbara Chapman,et al.  Using OpenMP - portable shared memory parallel programming , 2007, Scientific and engineering computation.

[21]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[22]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[23]  Eytan Domany,et al.  Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[24]  R. Rajwar,et al.  Transactional Execution: Toward Reliable, High-Performance Multithreading , 2003, IEEE Micro.

[25]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[26]  Sio Iong Ao,et al.  Combining functional and linkage disequilibrium information in the selection of tag SNPs , 2007, Bioinform..

[27]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[28]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[29]  Philip A. Bernstein,et al.  Principles of Transaction Processing , 1996 .

[30]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Sandrine Dudoit,et al.  Applications of Resampling Methods to Estimate the Number of Clusters and to Improve the Accuracy of , 2001 .

[32]  Ulrich Drepper Parallel Programming with Transactional Memory , 2008, ACM Queue.

[33]  Wu-chun Feng,et al.  Tools and Environments for Multicore and Many-Core Architectures , 2009, Computer.

[34]  R. Shamir,et al.  Regulatory networks define phenotypic classes of human stem cell lines , 2008, Nature.

[35]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[36]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[37]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[38]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[39]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[40]  Xiaogang Wang,et al.  A roadmap of clustering algorithms: finding a match for a biomedical application , 2008, Briefings Bioinform..

[41]  Sio Iong Ao,et al.  Clustering of SNP Data with Application to Genomics , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[42]  Michael Keith,et al.  Exposing the ORM Cache , 2008, ACM Queue.

[43]  M K Kerr,et al.  Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Doug Lea,et al.  Concurrent programming in Java - design principles and patterns , 1996, Java series.

[45]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[46]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[47]  A. Volgenant,et al.  A shortest augmenting path algorithm for dense and sparse linear assignment problems , 1987, Computing.

[48]  Debashis Ghosh,et al.  Cluster stability scores for microarray data in cancer studies , 2003, BMC Bioinformatics.

[49]  Donald R. Barr,et al.  A comparison of multivariate normal generators , 1972, Commun. ACM.

[50]  Håkan Grahn,et al.  Transactional memory , 2010, J. Parallel Distributed Comput..

[51]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[52]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.