Identification of Regulatory Modules in Time Series Gene Expression Data Using a Linear Time Biclustering Algorithm

Although most biclustering formulations are NP-hard, in time series expression data analysis, it is reasonable to restrict the problem to the identification of maximal biclusters with contiguous columns, which correspond to coherent expression patterns shared by a group of genes in consecutive time points. This restriction leads to a tractable problem. We propose an algorithm that finds and reports all maximal contiguous column coherent biclusters in time linear in the size of the expression matrix. The linear time complexity of CCC-Biclustering relies on the use of a discretized matrix and efficient string processing techniques based on suffix trees. We also propose a method for ranking biclusters based on their statistical significance and a methodology for filtering highly overlapping and, therefore, redundant biclusters. We report results in synthetic and real data showing the effectiveness of the approach and its relevance in the discovery of regulatory modules. Results obtained using the transcriptomic expression patterns occurring in Saccharomyces cerevisiae in response to heat stress show not only the ability of the proposed methodology to extract relevant information compatible with documented biological knowledge but also the utility of using this algorithm in the study of other environmental stresses and of regulatory modules in general.

[1]  Jiong Yang,et al.  A framework for ontology-driven subspace clustering , 2004, KDD.

[2]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[3]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[4]  C. Becquet,et al.  Strong-association-rule mining for large-scale gene-expression data analysis: a case study on human SAGE data , 2002, Genome Biology.

[5]  Alexander Schliep,et al.  The Graphical Query Language: a tool for analysis of gene expression time-courses , 2005 .

[6]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[7]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[8]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[9]  Jiong Yang,et al.  Gene ontology friendly biclustering of expression profiles , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[10]  Arlindo L. Oliveira,et al.  An Efficient Biclustering Algorithm for Finding Genes with Similar Patterns in Time-series Expression Data , 2007, APBC.

[11]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[12]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[13]  Kian-Lee Tan,et al.  Identifying time-lagged gene clusters using gene expression data , 2005, Bioinform..

[14]  Wojciech Szpankowski,et al.  Finding Biclusters by Random Projections , 2004, CPM.

[15]  Ya Zhang,et al.  A time-series biclustering algorithm for revealing co-regulated genes , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[16]  T M Swan,et al.  Stress tolerance in a yeast sterol auxotroph: role of ergosterol, heat shock proteins and trehalose. , 1998, FEMS microbiology letters.

[17]  K. McEntee,et al.  Identification of cis and trans components of a novel heat shock stress regulatory pathway in Saccharomyces cerevisiae , 1993, Molecular and cellular biology.

[18]  Bart De Moor,et al.  Biclustering microarray data by Gibbs sampling , 2003, ECCB.

[19]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[20]  Arlindo L. Oliveira,et al.  A Linear Time Biclustering Algorithm for Time Series Gene Expression Data , 2005, WABI.

[21]  Jiong Yang,et al.  Biclustering in gene expression data by tendency , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[22]  Holger H. Hoos,et al.  Inference of transcriptional regulation relationships from gene expression data , 2003, SAC '03.

[23]  Kwang-Hyun Cho,et al.  Microarray data clustering based on temporal variation: FCV with TSD preclustering. , 2003, Applied bioinformatics.

[24]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[25]  T. M. Murali,et al.  Extracting Conserved Gene Expression Motifs from Gene Expression Data , 2002, Pacific Symposium on Biocomputing.

[26]  Hans-Hermann Bock,et al.  Two-mode clustering methods: astructuredoverview , 2004, Statistical methods in medical research.

[27]  Arlindo L. Oliveira,et al.  An Evaluation of Discretization Methods for Non-Supervised Analysis of Time-Series Gene Expression Data , 2005 .

[28]  René Peeters,et al.  The maximum edge biclique problem is NP-complete , 2003, Discret. Appl. Math..

[29]  Pooja Jain,et al.  The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae , 2005, Nucleic Acids Res..

[30]  Jiong Yang,et al.  Mining Sequential Patterns from Large Data Sets , 2005, Advances in Database Systems.

[31]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[32]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[33]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[34]  Eckart Zitzler,et al.  Order Preserving Clustering over Multiple Time Course Experiments , 2005, EvoWorkshops.

[35]  Ziv Bar-Joseph,et al.  Analyzing time series gene expression data , 2004, Bioinform..

[36]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[37]  Kian-Lee Tan,et al.  Mining gene expression data for positive and negative co-regulated gene clusters , 2004, Bioinform..

[38]  David Martin,et al.  GOToolBox: functional analysis of gene datasets based on Gene Ontology , 2004, Genome Biology.

[39]  Ruggero G. Pensa,et al.  Assessment of discretization techniques for relevant pattern discovery from gene expression data , 2004, BIOKDD.

[40]  Ozgur Ozturk,et al.  A time series analysis of microarray data , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[41]  Wojciech Szpankowski,et al.  Biclustering gene-feature matrices for statistically significant dense patterns , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[42]  Yutao Fu,et al.  Gene expression module discovery using gibbs sampling. , 2004, Genome informatics. International Conference on Genome Informatics.