A system for pattern matching applications on biosequences

ANREP is a system for finding matches to patterns composed of (i) spacing constraints called 'spacers', and (ii) approximate matches to 'motifs' that are, recursively, patterns composed of 'atomic' symbols. A user specifies such patterns via a declarative, free-format and strongly typed language called A that is presented here in a tutorial style through a series of progressively more complex examples. The sample patterns are for protein and DNA sequences, the application domain for which ANREP was specifically created. ANREP provides a unified framework for almost all previously proposed biosequence patterns and extends them by providing approximate matching, a feature heretofore unavailable except for the limited case of individual sequences. The performance of ANREP is discussed and an appendix gives a concise specification of syntax and semantics. A portable C software package implementing ANREP is available via anonymous remote file transfer.

[1]  A. Bairoch PROSITE: a dictionary of sites and patterns in proteins. , 1991, Nucleic acids research.

[2]  Michael Gribskov,et al.  Profile scanning for three-dimensional structural patterns in protein sequences , 1988, Comput. Appl. Biosci..

[3]  L T Hunt,et al.  The PIR protein sequence database. , 1991, Nucleic acids research.

[4]  G. H. Hamm,et al.  The EMBL data library , 1993, Nucleic Acids Res..

[5]  A Klug,et al.  Repetitive zinc‐binding domains in the protein transcription factor IIIA from Xenopus oocytes. , 1985, The EMBO journal.

[6]  Esko Ukkonen,et al.  Finding Approximate Patterns in Strings , 1985, J. Algorithms.

[7]  D. K. Hawley,et al.  Compilation and analysis of Escherichia coli promoter DNA sequences. , 1983, Nucleic acids research.

[8]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[9]  M. Sternberg,et al.  Flexible protein sequence patterns. A sensitive method to detect weak structural similarities. , 1990, Journal of molecular biology.

[10]  E. Myers,et al.  Approximate matching of regular expressions. , 1989, Bulletin of mathematical biology.

[11]  Rodger Staden,et al.  Methods to define and locate patterns of motifs in sequences , 1988, Comput. Appl. Biosci..

[12]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[13]  Douglas L. Brutlag,et al.  Rapid searches for complex patterns in biological molecules , 1984, Nucleic Acids Res..

[14]  R J Roberts,et al.  Predictive motifs derived from cytosine methyltransferases. , 1989, Nucleic acids research.

[15]  G. Stormo Consensus patterns in DNA. , 1990, Methods in enzymology.