Comparative Sequence Analysis: Finding Genes

Publisher Summary This chapter discusses the various approaches for detecting sequence similarities, with emphasis on their applications to genome sequence analysis. The major goal of large-scale genome analysis is to identify and characterize as many genes of an organism as possible. The problem of finding genes in sequences resulting from large-scale projects is typically more challenging than the problem for the individual investigator because of the lack of contextual information about any particular segment of DNA. The sequences are determined as DNA but searched as protein. The chapter also describes, widely used algorithms for detecting sequence similarities and its significances; consensus methods for searching, simple and complex patterns of consensus representation and the use of blocks for representing the most highly conserved regions between the gaps.

[1]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[2]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[3]  R. Staden Searching for patterns in protein and nucleic acid sequences. , 1990, Methods in enzymology.

[4]  F. Vajdos,et al.  Pseudomonas cepacia 2,2-dialkylglycine decarboxylase. Sequence and expression in Escherichia coli of structural and repressor genes. , 1990, The Journal of biological chemistry.

[5]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[6]  S. Henikoff,et al.  Finding protein similarities with nucleotide sequence databases. , 1990, Methods in enzymology.

[7]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[8]  A. Kerlavage,et al.  Complementary DNA sequencing: expressed sequence tags and human genome project , 1991, Science.

[9]  T Gojobori,et al.  Codon usage tabulated from the GenBank Genetic Sequence Data. , 1988, Nucleic acids research.

[10]  S. Henikoff,et al.  rbcR [correction of rcbR], a gene coding for a member of the LysR family of transcriptional regulators, is located upstream of the expressed set of ribulose 1,5-bisphosphate carboxylase/oxygenase genes in the photosynthetic bacterium Chromatium vinosum , 1991, Journal of bacteriology.

[11]  Ian B. Dodd,et al.  Systematic method for the detection of potential λ Cro-like DNA-binding regions in proteins , 1987 .

[12]  S F Altschul,et al.  Protein database searches for multiple alignments. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[13]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[14]  J. Gall,et al.  Human Genome Sequencing , 1986, Science.

[15]  M. O. Dayhoff,et al.  Establishing homologies in protein sequences. , 1983, Methods in enzymology.

[16]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[17]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[18]  S Henikoff,et al.  Playing with blocks: some pitfalls of forcing multiple alignments. , 1991, The New biologist.

[19]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[20]  S. Henikoff,et al.  A relationship between asparagine synthetase A and aspartyl tRNA synthetase. , 1992, The Journal of biological chemistry.

[21]  Steven Henikoff,et al.  PATMAT: a searching and extraction program for sequence, pattern and block queries and databases , 1992, Comput. Appl. Biosci..

[22]  R. F. Smith,et al.  Automatic generation of primary sequence patterns from sets of related protein sequences. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[23]  J. Collins,et al.  Significance of protein sequence similarities. , 1990, Methods in enzymology.

[24]  R F Doolittle,et al.  Construction of a facsimile data set for large genome sequence analysis. , 1990, Genomics.

[25]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[26]  A. Bairoch PROSITE: a dictionary of sites and patterns in proteins. , 1991, Nucleic acids research.

[27]  Hamilton O. Smith,et al.  Finding sequence motifs in groups of functionally related proteins. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[28]  R J Roberts,et al.  Predictive motifs derived from cytosine methyltransferases. , 1989, Nucleic acids research.

[29]  R F Doolittle,et al.  Searching through sequence databases. , 1990, Methods in enzymology.

[30]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[31]  Mark S. Boguski,et al.  Similarity and Homology , 1991 .

[32]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank. , 1991, Nucleic acids research.

[33]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[34]  E. Uberbacher,et al.  Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. , 1991, Proceedings of the National Academy of Sciences of the United States of America.