Algorithm for low-variance biclusters to identify coregulation modules in sequencing datasets

High-throughput sequencing (CHIP-Seq) data exhibit binding events with possible binding locations and their strengths, followed by interpretations of the locations of peaks. Recent methods tend to summarize all CHIP-Seq peaks detected within a limited up and down region of each gene into one real-valued score in order to quantify the probability of regulation in a region. Applying subspace clustering (or biclustering) techniques on these scores would discover important knowledge such as the potential co-regulation or co-factors mechanisms. The ideal biclusters generated should contain subsets of genes, and transcription factors (TF) such that the cell-values in biclusters are distributed around a mean value with low variance. Such biclusters would indicate TF sets regulating gene sets with the same probability values. However, most existing biclustering algorithms are neither able to enforce variance as a strict limitation on the values contained in a bicluster, nor use variance as the guiding metric while searching for the desirable biclusters. An algorithm that uses search spaces defined by lattices containing all overlapping biclusters and a bound on variance values as the guiding metric is presented in this paper. The algorithm is shown to be an efficient and effective method for discovering the possibly overlapping biclusters under pre-defined variance bounds. We present in this paper our algorithm, its results with synthetic and CHIP-Seq and motif datasets, and compare them with the results obtained by other algorithms to demonstrate the power and effectiveness of our algorithm.

[1]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  T. Laajala,et al.  A practical comparison of methods for detecting transcription factor binding sites in ChIP-seq experiments , 2009, BMC Genomics.

[3]  Clifford A. Meyer,et al.  Genome-wide analysis of estrogen receptor binding sites , 2006, Nature Genetics.

[4]  Mohammed J. Zaki,et al.  Fast vertical mining using diffsets , 2003, KDD '03.

[5]  Luca Benini,et al.  Co-clustering: A Versatile Tool for Data Analysis in Biomedical Informatics , 2007, IEEE Transactions on Information Technology in Biomedicine.

[6]  W. Wong,et al.  ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells , 2009, Proceedings of the National Academy of Sciences.

[7]  L. Beran,et al.  [Formal concept analysis]. , 1996, Casopis lekaru ceskych.

[8]  Sven Bergmann,et al.  Defining transcription modules using large-scale gene expression data , 2004, Bioinform..

[9]  Zhen Hu,et al.  BMC Bioinformatics BioMed Central Methodology article CLEAN: CLustering Enrichment ANalysis , 2009 .

[10]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[11]  Philip S. Yu,et al.  /spl delta/-clusters: capturing subspace correlation in a large data set , 2002, Proceedings 18th International Conference on Data Engineering.

[12]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[13]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[14]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[15]  Raj Bhatnagar,et al.  An effective algorithm for mining 3-clusters in vertically partitioned data , 2008, CIKM '08.

[16]  N. D. Clarke,et al.  Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells , 2008, Cell.

[17]  Jun S. Liu,et al.  Integrating regulatory motif discovery and genome-wide expression analysis , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[19]  Padraig Cunningham,et al.  Biclustering of expression data using simulated annealing , 2005, 18th IEEE Symposium on Computer-Based Medical Systems (CBMS'05).

[20]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[21]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[22]  Mario Medvedovic,et al.  Genomics Portals: integrative web-platform for mining genomics data , 2010, BMC Genomics.

[23]  P. Park ChIP–seq: advantages and challenges of a maturing technology , 2009, Nature Reviews Genetics.

[24]  Raj Bhatnagar,et al.  An Efficient Constraint-Based Closed Set Mining Algorithm , 2007, Sixth International Conference on Machine Learning and Applications (ICMLA 2007).

[25]  A. Mortazavi,et al.  Computation for ChIP-seq and RNA-seq studies , 2009, Nature Methods.

[26]  Bernhard Ganter,et al.  Formal Concept Analysis , 2013 .

[27]  S. Batzoglou,et al.  Genome-Wide Analysis of Transcription Factor Binding Sites Based on ChIP-Seq Data , 2008, Nature Methods.

[28]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[29]  Yaniv Ziv,et al.  Revealing modular organization in the yeast transcriptional network , 2002, Nature Genetics.

[30]  David A. Nix,et al.  Empirical methods for controlling false positives and estimating confidence in ChIP-Seq peaks , 2008, BMC Bioinformatics.