A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series

BackgroundThe ability to monitor the change in expression patterns over time, and to observe the emergence of coherent temporal responses using gene expression time series, obtained from microarray experiments, is critical to advance our understanding of complex biological processes. In this context, biclustering algorithms have been recognized as an important tool for the discovery of local expression patterns, which are crucial to unravel potential regulatory mechanisms. Although most formulations of the biclustering problem are NP-hard, when working with time series expression data the interesting biclusters can be restricted to those with contiguous columns. This restriction leads to a tractable problem and enables the design of efficient biclustering algorithms able to identify all maximal contiguous column coherent biclusters.MethodsIn this work, we propose e-CCC-Biclustering, a biclustering algorithm that finds and reports all maximal contiguous column coherent biclusters with approximate expression patterns in time polynomial in the size of the time series gene expression matrix. This polynomial time complexity is achieved by manipulating a discretized version of the original matrix using efficient string processing techniques. We also propose extensions to deal with missing values, discover anticorrelated and scaled expression patterns, and different ways to compute the errors allowed in the expression patterns. We propose a scoring criterion combining the statistical significance of expression patterns with a similarity measure between overlapping biclusters.ResultsWe present results in real data showing the effectiveness of e-CCC-Biclustering and its relevance in the discovery of regulatory modules describing the transcriptomic expression patterns occurring in Saccharomyces cerevisiae in response to heat stress. In particular, the results show the advantage of considering approximate patterns when compared to state of the art methods that require exact matching of gene expression time series.DiscussionThe identification of co-regulated genes, involved in specific biological processes, remains one of the main avenues open to researchers studying gene regulatory networks. The ability of the proposed methodology to efficiently identify sets of genes with similar expression patterns is shown to be instrumental in the discovery of relevant biological phenomena, leading to more convincing evidence of specific regulatory mechanisms.AvailabilityA prototype implementation of the algorithm coded in Java together with the dataset and examples used in the paper is available in http://kdbio.inesc-id.pt/software/e-ccc-biclustering.

[1]  Ziv Bar-Joseph,et al.  Analyzing time series gene expression data , 2004, Bioinform..

[2]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  Kian-Lee Tan,et al.  Mining gene expression data for positive and negative co-regulated gene clusters , 2004, Bioinform..

[4]  David Martin,et al.  GOToolBox: functional analysis of gene datasets based on Gene Ontology , 2004, Genome Biology.

[5]  Arlindo L. Oliveira,et al.  An Efficient Biclustering Algorithm for Finding Genes with Similar Patterns in Time-series Expression Data , 2007, APBC.

[6]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[7]  Jinze Liu,et al.  Biclustering in gene expression data by tendency , 2004 .

[8]  Weiqi Wang,et al.  Gene ontology friendly biclustering of expression profiles , 2004 .

[9]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[10]  Jiong Yang,et al.  A framework for ontology-driven subspace clustering , 2004, KDD.

[11]  Bart De Moor,et al.  Biclustering microarray data by Gibbs sampling , 2003, ECCB.

[12]  Marie-France Sagot,et al.  Spelling Approximate Repeated or Common Motifs Using a Suffix Tree , 1998, LATIN.

[13]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[14]  Ya Zhang,et al.  A time-series biclustering algorithm for revealing co-regulated genes , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[15]  Pooja Jain,et al.  The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae , 2005, Nucleic Acids Res..

[16]  Jiong Yang,et al.  Mining Sequential Patterns from Large Data Sets , 2005, Advances in Database Systems.

[17]  Wojciech Szpankowski,et al.  Finding biclusters by random projections , 2006, Theor. Comput. Sci..

[18]  Kian-Lee Tan,et al.  Identifying time-lagged gene clusters using gene expression data , 2005, Bioinform..

[19]  Arlindo L. Oliveira,et al.  Efficient Biclustering Algorithms for Time Series Gene Expression Data Analysis , 2009, IWANN.

[20]  Eckart Zitzler,et al.  BicAT: a biclustering analysis toolbox , 2006, Bioinform..

[21]  Ioannis P. Androulakis,et al.  A novel non-overlapping bi-clustering algorithm for network generation using living cell array data , 2007, Bioinform..

[22]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[23]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[24]  Wojciech Szpankowski,et al.  Biclustering gene-feature matrices for statistically significant dense patterns , 2004 .

[25]  René Peeters,et al.  The maximum edge biclique problem is NP-complete , 2003, Discret. Appl. Math..

[26]  I. Androulakis,et al.  Analysis of time-series gene expression data: methods, challenges, and opportunities. , 2007, Annual review of biomedical engineering.

[27]  Hans-Hermann Bock,et al.  Two-mode clustering methods: astructuredoverview , 2004, Statistical methods in medical research.

[28]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[29]  Arlindo L. Oliveira,et al.  A Linear Time Biclustering Algorithm for Time Series Gene Expression Data , 2005, WABI.

[30]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[31]  Yutao Fu,et al.  Gene expression module discovery using gibbs sampling. , 2004, Genome informatics. International Conference on Genome Informatics.

[32]  Arlindo L. Oliveira,et al.  Identification of Regulatory Modules in Time Series Gene Expression Data Using a Linear Time Biclustering Algorithm , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  T. M. Murali,et al.  Extracting Conserved Gene Expression Motifs from Gene Expression Data , 2002, Pacific Symposium on Biocomputing.