Detecting Motifs from Sequences

DetectingMotifsfromSequencesYuh-Jyh Hu1, Dennis Kibler, and Suzanne Sandmeyer2Information and Computer Science Department1Biological Chemistry Department2University of California, Irvineyhu@ics.uci.edu, kibler@ics.uci.edu, sbsandme@uci.eduAbstractThe problem of multiple global comparison infamilies of biological sequences has b een well-studied.Feweralgorithms haveb eendevel-op ed for identifying lo cal consensus patternsormotifs inbiological sequence.Thesetwoimp ortant problems have di erent biologicalconstraints and, consequently, di erent com-putational approaches.The diculty of nd-ing the biologically meaningful motifs resultsfrom (1) the variation among motif bases, (2)the alignment of motif p osition (sites) amongthe sequences, and (3) the multiplicity of mo-tifo ccurrenceswithinagivensequence.Inthis pap er, we review and compare the mainapproaches for nding motifs. We also intro-duceourownapproach,DMS,whichcom-binestwoob jectivefunctionswithanim-proved iterative sampling search metho d. Wedemonstratethee ectivenessofvariousalgorithms by comparing them on 10 real do-mains and14arti cialdomains.Themainadvantage of DMS is that it is b etter able to nd shorter motifs.1Intro ductionGenome pro jects are generating large data sets of ge-nomicsequencedata.However,thesizeandsp eedof acquisition of these data sets exceedsexp erimentalanalysesandinterpretations.Among othergenomessequenced, yeast was completely sequenced in 1995. Ithas 12 million base pairs (bps) and ab out 6,000 genes.To the surprise of biologists, the biological functions ofonly ab out 2,000 genes were known.The functions ofanother 2,000 genes might b e guessedat by compari-son.The functions of the remaining 2,000 genes, calledorphans, are unknown.Recently the complete genome(approximately 100 million bps) of a multi-celled ani-mal (C. elegans) was determined.Within a few yearsthe sequencing of the human genome (approximately 3billion bps) is anticipated. Once the genome and geneshave b een determined there are two essential questionsto b e answered:1) What is the function of each gene,and 2) When is the gene expressed?The rstquestionhasb eenheavilystudiedandpri-marily dep endson characterizinga gene family.Themost successful way of characterizing a gene has b eenbasedonprobabilisticmo dels,usuallysomeinstan-tiationofHiddenMarkovMo dels(HMMs).HMMswork well for this problem since they provide a globalmo delwhichallowsinsertions,deletions,andtrans-p ositions.These capabilities match the intuition thatsimilar genes have had a common evolutionary historyand the evolution pro cess involves insertions, deletionsand changes to the base pairs.Thesecondquestionhasb eenlesswellstudiedandhasa verydi erentcharacter.Biologists have deter-mined that the control or regulation of gene expressionin animals is primarily determined by relatively shortsequencesin the upstream or surrounding region of agene.These sequences vary in length from ab out 5 to12, have large amount of variability in their base con-stituency, do not have inserts or deletes, do not o ccurinthesamep osition,andsometimeso ccurmultipletimes.These qualities prohibit the simple applicationof HMMs.Severalmetho dshaveb eendevelop edfordetectingpatternssharedbyfunctionallyrelatedbiosequences(Heldenet.al., 1998; Hertz & Stormo, 1995; Hertzal., 1990; Bailey & Elkan, 1995; Lawrenceet., 1993;Hughey and Krogh, 1996; Eddy, 1995).Thesemeth-o dsemploydi erentrepresentations,ob jectivefunc-

[1]  Gary D. Stormo,et al.  Identification of consensus patterns in unaligned DNA sequences known to be functionally related , 1990, Comput. Appl. Biosci..

[2]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[3]  G. Stormo Computer methods for analyzing sequence recognition of nucleic acids. , 1988, Annual Review of Biophysics and Biophysical Chemistry.

[4]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[5]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[6]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[7]  Steven E. Hampson,et al.  Large plateaus and plateau search in Boolean Satisfiability problems: When to give up searching and start again , 1993, Cliques, Coloring, and Satisfiability.

[8]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[9]  R. Harr,et al.  Search algorithm for pattern match analysis of nucleic acid sequences. , 1983, Nucleic acids research.

[10]  G. Stormo,et al.  Identification of consensus patterns in unaligned dna and protein sequences: a large-deviation stati , 1995 .

[11]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[12]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..