High Performance Parallel/Distributed Biclustering Using Barycenter Heuristic

Biclustering refers to simultaneous clustering of objects and their features. Use of biclustering is gaining momentum in areas such as text mining, gene expression analysis and collaborative filtering. Due to requirements for high performance in large scale data processing applications such as Collaborative filtering in E-commerce systems and large scale genome-wide gene expression analysis in microarray experiments, a high performance prallel/distributed solution for biclustering problem is highly desirable. Recently, Ahmad et al [1] showed that Bipartite Spectral Partitioning, which is a popular technique for biclustering, can be reformulated as a graph drawing problem where objective is to minimize Hall’s energy of the bipartite graph representation of the input data. They showed that optimal solution to this problem is achieved when nodes are placed at the barycenter of their neighbors. In this paper, we provide a parallel algorithm for biclustering based on this formulation. We show that parallel energy minimization using barycenter heuristic is embarrassingly parallel. The challenge is to design a bicluster identification algorithm which is scalable as well as accurate. We show that our parallel implementation is not just extremely scalable, it is comparable in accuracy as well with serial implementation. We have evaluated proposed parallel biclustering algorithm with large synthetic data sets on upto 256 processors. Experimental evaluation shows large superlinear speedups, scalability and high level of accuracy.

[1]  Jianhong Zhou,et al.  ParRescue: Scalable Parallel Algorithm and Implementation for Biclustering over Large Distributed Datasets , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[2]  Lusheng Wang,et al.  Computing the maximum similarity bi-clusters of gene expression data , 2007, Bioinform..

[3]  Ling Qin,et al.  A Parallel Biclustering Algorithm for Gene Expressing Data , 2008, 2008 Fourth International Conference on Natural Computation.

[4]  C. Ding,et al.  Spectral relaxation models and structure analysis for K-way graph clustering and bi-clustering , 2001 .

[5]  Bryan A. Pendleton,et al.  Power of the Few vs. Wisdom of the Crowd: Wikipedia and the Rise of the Bourgeoisie , 2006 .

[6]  Inderjit S. Dhillon,et al.  Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[7]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[8]  S. D. Pietra,et al.  Statistical Learning Algorithms Based on Bregman Distances , 1997 .

[9]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[11]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[12]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[13]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[14]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[15]  Waseem Ahmad,et al.  Phoenix: Privacy Preserving Biclustering on Horizontally Partitioned Data , 2007, PinKDD.

[16]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[17]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[18]  Kei-Hoi Cheung,et al.  YMD: a microarray database for large-scale gene expression analysis , 2002, AMIA.

[19]  Srujana Merugu,et al.  A scalable collaborative filtering framework based on co-clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[20]  Waseem Ahmad,et al.  An Architecture for Privacy Preserving Collaborative Filtering on Web Portals , 2007 .