DeBi: Discovering Differentially Expressed Biclusters using a Frequent Itemset Approach

BackgroundThe analysis of massive high throughput data via clustering algorithms is very important for elucidating gene functions in biological systems. However, traditional clustering methods have several drawbacks. Biclustering overcomes these limitations by grouping genes and samples simultaneously. It discovers subsets of genes that are co-expressed in certain samples. Recent studies showed that biclustering has a great potential in detecting marker genes that are associated with certain tissues or diseases. Several biclustering algorithms have been proposed. However, it is still a challenge to find biclusters that are significant based on biological validation measures. Besides that, there is a need for a biclustering algorithm that is capable of analyzing very large datasets in reasonable time.ResultsHere we present a fast biclustering algorithm called DeBi (Differentially Expressed BIclusters). The algorithm is based on a well known data mining approach called frequent itemset. It discovers maximum size homogeneous biclusters in which each gene is strongly associated with a subset of samples. We evaluate the performance of DeBi on a yeast dataset, on synthetic datasets and on human datasets.ConclusionsWe demonstrate that the DeBi algorithm provides functionally more coherent gene sets compared to standard clustering or biclustering algorithms using biological validation measures such as Gene Ontology term and Transcription Factor Binding Site enrichment. We show that DeBi is a computationally efficient and powerful tool in analyzing large datasets. The method is also applicable on multiple gene expression datasets coming from different labs or platforms.

[1]  Ting Wang,et al.  An improved map of conserved regulatory sites for Saccharomyces cerevisiae , 2006, BMC Bioinformatics.

[2]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[3]  Jill P. Mesirov,et al.  Subclass Mapping: Identifying Common Subtypes in Independent Disease Data Sets , 2007, PloS one.

[4]  Christian von Mering,et al.  STRING 8—a global view on proteins and their functional interactions in 630 organisms , 2008, Nucleic Acids Res..

[5]  Ulrich Bodenhofer,et al.  FABIA: factor analysis for bicluster acquisition , 2010, Bioinform..

[6]  W. Markesbery,et al.  Impaired Proteasome Function in Alzheimer's Disease , 2000, Journal of neurochemistry.

[7]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  B. Pugh,et al.  Identification and Distinct Regulation of Yeast TATA Box-Containing Genes , 2004, Cell.

[9]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[10]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[11]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[12]  Ying Xu,et al.  QUBIC: a qualitative biclustering algorithm for analyses of gene expression data , 2009, Nucleic acids research.

[13]  T. M. Murali,et al.  Automatic layout and visualization of biclusters , 2006, Algorithms for Molecular Biology.

[14]  T. M. Murali,et al.  Extracting Conserved Gene Expression Motifs from Gene Expression Data , 2002, Pacific Symposium on Biocomputing.

[15]  Fang Liu,et al.  The Ubiquitin-Proteasome Pathway Mediates Gelsolin Protein Downregulation in Pancreatic Cancer , 2008, Molecular medicine.

[16]  Sven Bergmann,et al.  Iterative signature algorithm for the analysis of large-scale gene expression data. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[17]  Justin Lamb,et al.  The Connectivity Map: a new tool for biomedical research , 2007, Nature Reviews Cancer.

[18]  Stuart K. Calderwood,et al.  Heat shock proteins in cancer: diagnostic, prognostic, predictive, and treatment implications , 2005, Cell stress & chaperones.

[19]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[20]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[21]  Karuturi R. Krishna Murthy,et al.  Differential co-expression framework to quantify goodness of biclusters and compare biclustering algorithms , 2010, Algorithms for Molecular Biology.

[22]  Xiaogang Wang,et al.  A roadmap of clustering algorithms: finding a match for a biomedical application , 2008, Briefings Bioinform..

[23]  Maurice K. Wong,et al.  Algorithm AS136: A k-means clustering algorithm. , 1979 .

[24]  Johannes Gehrke,et al.  MAFIA: a maximal frequent itemset algorithm for transactional databases , 2001, Proceedings 17th International Conference on Data Engineering.

[25]  L. Staudt,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[26]  Eckart Zitzler,et al.  BicAT: a biclustering analysis toolbox , 2006, Bioinform..

[27]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[28]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[29]  Ron Shamir,et al.  EXPANDER – an integrative program suite for microarray data analysis , 2005, BMC Bioinformatics.

[30]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.