cHawk : An Efficient Biclustering Algorithm based on Bipartite Graph Crossing Minimization

Biclustering is a very useful data mining technique for gene expression analysis and profiling. It helps identify patterns where different genes are co-related based on a subset of conditions. Bipartite Spectral partitioning is a powerful technique to achieve biclustering but its computation complexity is prohibitive for applications dealing with large input data. We provide a connection between spectral partitioning and crossing minimization which is amenable to efficient implementations. Theoretical construction of Biclustering model based on crossing minimization is provided. Based on this model, an efficient biclustering algorithm, which is termed as cHawk, is developed. We have evaluated cHawk on both synthetic and real data sets. We show that cHawk is able to identify, with good accuracy, constant, coherent and overlapped biclusters amid noise. Moreover, its execution time grows linearly with input data size.

[1]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[2]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[3]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[4]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[5]  Eckart Zitzler,et al.  BicAT: a biclustering analysis toolbox , 2006, Bioinform..

[6]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Amir Hussain,et al.  A new biclustering technique based on crossing minimization , 2006, Neurocomputing.

[8]  Inderjit S. Dhillon,et al.  Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[9]  Waseem Ahmad,et al.  An Architecture for Privacy Preserving Collaborative Filtering on Web Portals , 2007 .

[10]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[11]  Matthias F. Stallmann,et al.  Heuristics and Experimental Design for Bigraph Crossing Number Minimization , 1999, ALENEX.

[12]  S. D. Pietra,et al.  Statistical Learning Algorithms Based on Bregman Distances , 1997 .

[13]  Ahmed H. Tewfik,et al.  Robust biclustering algorithm (ROBA) for DNA microarray data analysis , 2005, IEEE/SP 13th Workshop on Statistical Signal Processing, 2005.

[14]  Farhad Shahrokhi,et al.  On Bipartite Drawings and the Linear Arrangement Problem , 2001, SIAM J. Comput..

[15]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  Srujana Merugu,et al.  A scalable collaborative filtering framework based on co-clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[17]  David S. Johnson,et al.  Crossing Number is NP-Complete , 1983 .

[18]  Bojan Mohar,et al.  Optimal linear labelings and eigenvalues of graphs , 1992, Discret. Appl. Math..

[19]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[20]  T. M. Murali,et al.  Extracting Conserved Gene Expression Motifs from Gene Expression Data , 2002, Pacific Symposium on Biocomputing.

[21]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[22]  Mitsuhiko Toda,et al.  Methods for Visual Understanding of Hierarchical System Structures , 1981, IEEE Transactions on Systems, Man, and Cybernetics.

[23]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[24]  Ron Shamir,et al.  CLICK and EXPANDER: a system for clustering and visualizing gene expression data , 2003, Bioinform..

[25]  Sven Bergmann,et al.  Defining transcription modules using large-scale gene expression data , 2004, Bioinform..

[26]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[27]  Roded Sharan,et al.  Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Lusheng Wang,et al.  Computing the maximum similarity bi-clusters of gene expression data , 2007, Bioinform..

[29]  C. Ding,et al.  Spectral relaxation models and structure analysis for K-way graph clustering and bi-clustering , 2001 .