Genome Analysis: Pattern Search in Biological Macromolecules

Biological sequence data analysis has developed into an inevitable tool for macromolecular biology, key to any detailed understanding of the living cell. A brief survey on the biological macromolecules and their function is given. Sequence data analysis is introduced as a basic tool for the experimental bench biologist. So far, most queries for such analyses are issued on flat files and static indices. We discuss position tree structures and their potential in sequence data analysis. The hash position tree is introduced as a persistent, dynamic data structure for pattern searches in large sequence databases in biology.

[1]  Christopher J. Rawlings,et al.  Software Directory for Molecular Biologists , 1986 .

[2]  John Riedl,et al.  Generalized suffix trees for biological sequence data: applications and implementation , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[3]  Hans-Werner Mewes,et al.  The PIR-International databases , 1993, Nucleic Acids Res..

[4]  Christophe Lefèvre,et al.  The position end-set tree: a small automaton for word recognition in biological sequences , 1993, Comput. Appl. Biosci..

[5]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[6]  P Argos,et al.  Protein sequence comparison: methods and significance. , 1991, Protein engineering.

[7]  Patricia Rodriguez-Tomé,et al.  The European Bioinformatics Institute (EBI) databases , 1994, Nucleic Acids Res..

[8]  P. Edman,et al.  A method for the determination of amino acid sequence in peptides. , 1949, Archives of biochemistry.

[9]  P Bork,et al.  An ATPase domain common to prokaryotic cell cycle proteins, sugar kinases, actin, and hsp70 heat shock proteins. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Kenneth H. Fasman,et al.  The GDB human genome data base anno 1993 , 1993, Nucleic Acids Res..

[11]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[12]  A. J. Barr,et al.  SAS user's guide , 1979 .

[13]  Christopher J. Rawlings,et al.  Nucleic Acid and Protein Sequence Analysis , 1987 .

[14]  P. Meisel Margaret O. Dayhoff: Atlas of Protein Sequence and Structure 1969 (Volume 4) XXIV u. 361 S., 21 Ausklapptafeln, 68 Abb. und zahlreiche Tabellen. National Biomedical Research Foundation, Silver Spring/Maryland 1969. Preis $ 12,50 , 1971 .

[15]  P Argos,et al.  A sensitive procedure to compare amino acid sequences. , 1987, Journal of molecular biology.

[16]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank: current status. , 1994, Nucleic acids research.

[17]  M. O. Dayhoff,et al.  Origins of prokaryotes, eukaryotes, mitochondria, and chloroplasts. , 1978, Science.

[18]  Ricardo A. Baeza-Yates,et al.  An Algorithm for String Matching with a Sequence of don't Cares , 1991, Inf. Process. Lett..

[19]  R F Doolittle,et al.  Searching through sequence databases. , 1990, Methods in enzymology.

[20]  P. Bork,et al.  Proposed acquisition of an animal protein domain by bacteria. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[21]  W. Gilbert,et al.  A new method for sequencing DNA. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[23]  Isidore Rigoutsos,et al.  FLASH: a fast look-up algorithm for string homology , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[24]  R. Doolittle Of urfs and orfs : a primer on how to analyze devised amino acid sequences , 1986 .

[25]  A. Lesk COMPUTATIONAL MOLECULAR BIOLOGY , 1988, Proceeding of Data For Discovery.

[26]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[27]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[28]  A. Bairoch PROSITE: a dictionary of sites and patterns in proteins. , 1991, Nucleic acids research.

[29]  Jack Belzer,et al.  Encyclopedia of Computer Science and Technology , 2002 .

[30]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[31]  Christophe Lefèvre,et al.  Pattern recognition in DNA sequences and its application to consensus foot-printing , 1993, Comput. Appl. Biosci..

[32]  M S Waterman,et al.  Rapid and accurate estimates of statistical significance for sequence data base searches. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[33]  K. H. Fasman,et al.  The GDB Human Genome Data Base anno 1994. , 1994, Nucleic acids research.

[34]  Donald E. Knuth,et al.  Sorting and Searching , 1973 .

[35]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Gaston H. Gonnet,et al.  A new approach to text searching , 1992, CACM.

[37]  F. Sanger,et al.  The amino-acid sequence in the phenylalanyl chain of insulin. 2. The investigation of peptides from enzymic hydrolysates. , 1951, The Biochemical journal.

[38]  D. Shasha,et al.  Discovering active motifs in sets of related protein sequences and using them for classification. , 1994, Nucleic acids research.

[39]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[40]  F. Sanger,et al.  The arrangement of amino acids in proteins. , 1952, Advances in protein chemistry.

[41]  Thomas Sudkamp,et al.  Languages and Machines , 1988 .

[42]  M. Aigle,et al.  Complete DNA sequence of yeast chromosome II. , 1994, The EMBO journal.

[43]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[44]  C. Sander,et al.  Comprehensive sequence analysis of the 182 predicted open reading frames of yeast chromosome III , 1992, Protein science : a publication of the Protein Society.

[45]  M. Sternberg,et al.  Flexible protein sequence patterns. A sensitive method to detect weak structural similarities. , 1990, Journal of molecular biology.

[46]  M. Bishop,et al.  Nucleic acid and protein sequence analysis : a practical approach , 1987 .

[47]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[48]  M S Waterman,et al.  Sequence alignment and penalty choice. Review of concepts, case studies and implications. , 1994, Journal of molecular biology.

[49]  D. G. George,et al.  Mutation data matrix and its uses. , 1990, Methods in enzymology.

[50]  Jonathan A. Cooper,et al.  Complete nucleotide sequence of Saccharomyces cerevisiae chromosome VIII. , 1994, Science.

[51]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[52]  David Maier,et al.  The Complexity of Some Problems on Subsequences and Supersequences , 1978, JACM.

[53]  J. Richardson,et al.  Simultaneous comparison of three protein sequences. , 1985, Proceedings of the National Academy of Sciences of the United States of America.

[54]  C. Sensen,et al.  Complete DNA sequence of yeast chromosome XI , 1994, Nature.

[55]  J. Devereux,et al.  A comprehensive set of sequence analysis programs for the VAX , 1984, Nucleic Acids Res..

[56]  C. Sander,et al.  Yeast chromosome III: new gene functions. , 1994, The EMBO journal.

[57]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[58]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[59]  B. Dujon,et al.  The complete DNA sequence of yeast chromosome III , 1992, Nature.