Combinatorial motif analysis and hypothesis generation on a genomic scale

MOTIVATION Computer-assisted methods are essential for the analysis of biosequences. Gene activity is regulated in part by the binding of regulatory molecules (transcription factors) to combinations of short motifs. The goal of our analysis is the development of algorithms to identify regulatory motifs and to predict the activity of combinations of those motifs. APPROACH Our research begins with a new motif-finding method, using multiple objective functions and an improved stochastic iterative sampling strategy. Combinatorial motif analysis is accomplished by constructive induction that analyzes potential motif combinations. The hypothesis is generated by applying standard inductive learning algorithms. RESULTS Tests using 10 previously identified regulons from budding yeast and 14 artificial families of sequences demonstrated the effectiveness of the new motif-finding method. Motif combination and classification approaches were used in the analysis of a sample DNA array data set derived from genome-wide gene expression analysis. AVAILABILITY Programs will be available as executable files upon request. CONTACT yhu@ics.uci.eduor yhu@cse.ttu.edu.tw

[1]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[2]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[3]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[4]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[5]  G. Stormo,et al.  Identification of consensus patterns in unaligned dna and protein sequences: a large-deviation stati , 1995 .

[6]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[7]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[8]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[9]  G. Stormo Computer methods for analyzing sequence recognition of nucleic acids. , 1988, Annual Review of Biophysics and Biophysical Chemistry.

[10]  John T. Lis,et al.  15 Structure and Regulation of Heat Shock Gene Promoters , 1994 .

[11]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[12]  Yuh-Jyh Hu,et al.  Generation of Attributes for Learning Algorithms , 1996, AAAI/IAAI, Vol. 1.

[13]  Timothy Bailey Likelihood vs. Information in Aligning Biopolymer Sequences , 1993 .

[14]  Gary D. Stormo,et al.  Identification of consensus patterns in unaligned DNA sequences known to be functionally related , 1990, Comput. Appl. Biosci..

[15]  R. Harr,et al.  Search algorithm for pattern match analysis of nucleic acid sequences. , 1983, Nucleic acids research.

[16]  K. McEntee,et al.  Identification of cis and trans components of a novel heat shock stress regulatory pathway in Saccharomyces cerevisiae , 1993, Molecular and cellular biology.

[17]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[18]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[19]  Richard I. Morimoto,et al.  1 Progress and Perspectives on the Biology of Heat Shock Proteins and Molecular Chaperones , 1994 .

[20]  L. Wodicka,et al.  Genome-wide expression monitoring in Saccharomyces cerevisiae , 1997, Nature Biotechnology.

[21]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[22]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[23]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[24]  Yuh-Jyh Hu,et al.  Detecting Motifs from Sequences , 1999, ICML.

[25]  Steven E. Hampson,et al.  Large plateaus and plateau search in Boolean Satisfiability problems: When to give up searching and start again , 1993, Cliques, Coloring, and Satisfiability.

[26]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[27]  Yuh-Jyh Hu Constructive Induction: Covering Attribute Spectrum , 1998 .