Identification of Sequence Patterns, Motifs and Domains

The evolutionary process is constrained by the structure and function of biological molecules. These constraints appear in macromolecular sequences as conserved regions. Because of the tight coupling between sequence conservation and molecular function, a huge amount of effort has gone into developing methods for describing sequence patterns, motifs, and domains, and into developing efficient methods for identifying novel examples of known motifs. In computational biology, the terms patterns, motifs, and domains are used somewhat interchangeably, depending on the size of the molecular patter under discussion; in general shorter patterns are often referred to as motifs, signatures, or patterns, while longer patterns are likely to be referred to as domains. Current approaches have their roots in sequence alignment, but more recent methods employ many techniques from machine learning.

[1]  M. Tompa,et al.  Discovery of novel transcription factor binding sites by statistical overrepresentation. , 2002, Nucleic acids research.

[2]  Kevin Karplus,et al.  Evaluating Regularizers for Estimating Distributions of Amino Acids , 1995, ISMB.

[3]  Roland L Dunbrack,et al.  Scoring profile‐to‐profile sequence alignments , 2004, Protein science : a publication of the Protein Society.

[4]  Julie Dawn Thompson,et al.  Improved sensitivity of profile searches through the use of sequence weights and gap excision , 1994, Comput. Appl. Biosci..

[5]  Graziano Pesole,et al.  Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes , 2004, Nucleic Acids Res..

[6]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[7]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[8]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[9]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[10]  W. Pearson Effective protein sequence comparison. , 1996, Methods in enzymology.

[11]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[12]  Michael Gribskov,et al.  The Megaprior Heuristic for Discovering Protein Sequence Patterns , 1996, ISMB.

[13]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[14]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[15]  Alan Bridge,et al.  New and continuing developments at PROSITE , 2012, Nucleic Acids Res..

[16]  David Haussler,et al.  Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[17]  S. McKnight,et al.  Homologous recognition of a promoter domain common to the MSV LTR and the HSV tk gene , 1986, Cell.

[18]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[19]  David A. Lee,et al.  CATH: an expanded resource to predict protein function through structure and sequence , 2016, Nucleic Acids Res..

[20]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[21]  M. Zweig,et al.  Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. , 1993, Clinical chemistry.

[22]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[23]  P. Argos,et al.  Weighting aligned protein or nucleic acid sequences to correct for unequal representation. , 1990, Journal of molecular biology.

[24]  Richard Durbin,et al.  Method for Calculation of Probability of Matching a Bounded Regular Expression in a Random Data String , 1995, J. Comput. Biol..

[25]  Douglas L. Brutlag,et al.  The EMOTIF database , 2001, Nucleic Acids Res..

[26]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[27]  M. Gribskov,et al.  [13] Identification of sequence patterns with profile analysis , 1996 .

[28]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[29]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[30]  Eleazar Eskin,et al.  Using mixtures of common ancestors for estimating the probabilities of discrete events in biological sequences , 2001, ISMB.

[31]  D. Brutlag,et al.  Highly specific protein sequence motifs for genome analysis. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[32]  D. Hogness,et al.  The organization of the histone genes in Drosophila melanogaster: functional and evolutionary implications. , 1978, Cold Spring Harbor symposia on quantitative biology.

[33]  J. Collado-Vides,et al.  Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.

[34]  D. Pribnow Nucleotide sequence of an RNA polymerase binding site at an early T7 promoter. , 1975, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[36]  K. Struhl,et al.  Defining the consensus sequences of E.coli promoter elements by random selection. , 1988, Nucleic acids research.

[37]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[38]  K. Hofmann,et al.  Dissection of USP catalytic domains reveals five common insertion points. , 2009, Molecular bioSystems.

[39]  Jun S. Liu,et al.  Gibbs motif sampling: Detection of bacterial outer membrane protein repeats , 1995, Protein science : a publication of the Protein Society.