Large-Scale Sequence Comparison.

There are millions of sequences deposited in genomic databases, and it is an important task to categorize them according to their structural and functional roles. Sequence comparison is a prerequisite for proper categorization of both DNA and protein sequences, and helps in assigning a putative or hypothetical structure and function to a given sequence. There are various methods available for comparing sequences, alignment being first and foremost for sequences with a small number of base pairs as well as for large-scale genome comparison. Various tools are available for performing pairwise large sequence comparison. The best known tools either perform global alignment or generate local alignments between the two sequences. In this chapter we first provide basic information regarding sequence comparison. This is followed by the description of the PAM and BLOSUM matrices that form the basis of sequence comparison. We also give a practical overview of currently available methods such as BLAST and FASTA, followed by a description and overview of tools available for genome comparison including LAGAN, MumMER, BLASTZ, and AVID.

[1]  W. Pearson Effective protein sequence comparison. , 1996, Methods in enzymology.

[2]  David Wheeler,et al.  Selecting the Right Protein‐Scoring Matrix , 2003, Current protocols in bioinformatics.

[3]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[4]  Marco Pagni,et al.  Dotlet: diagonal plots in a Web browser , 2000, Bioinform..

[5]  S F Altschul,et al.  Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. , 1998, Trends in biochemical sciences.

[6]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[7]  Ryuichiro Nakato,et al.  Cgaln: fast and space-efficient whole-genome alignment , 2010, BMC Bioinform..

[8]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[9]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[10]  Michael Brudno,et al.  Fast and sensitive alignment of large genomic sequences , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[11]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[12]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[13]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[14]  Joaquín Dopazo,et al.  ETE: a python Environment for Tree Exploration , 2010, BMC Bioinformatics.

[15]  Kevin Karplus,et al.  A Flexible Motif Search Technique Based on Generalized Profiles , 1996, Comput. Chem..

[16]  Ryuichiro Nakato,et al.  A Novel Method for Reducing Computational Complexity of Whole Genome Sequence Alignment , 2008, APBC.

[17]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Nicholas L. Bray,et al.  AVID: A global alignment program. , 2003, Genome research.

[19]  Narmada Thanki,et al.  CDD: specific functional annotation with the Conserved Domain Database , 2008, Nucleic Acids Res..

[20]  Rodger Staden,et al.  Methods to define and locate patterns of motifs in sequences , 1988, Comput. Appl. Biosci..

[21]  Paul Stothard,et al.  Comparing thousands of circular genomes using the CGView Comparison Tool , 2012, BMC Genomics.

[22]  Michael Brudno,et al.  FRESCO: Flexible Alignment with Rectangle Scoring Schemes , 2007, Pacific Symposium on Biocomputing.

[23]  F. Blattner,et al.  Mauve: multiple alignment of conserved genomic sequence with rearrangements. , 2004, Genome research.

[24]  M. Frith,et al.  Adaptive seeds tame genomic sequence comparison. , 2011, Genome research.

[25]  A. Gibbs,et al.  The Diagram, a Method for Comparing Sequences , 1970 .

[26]  William Pearson,et al.  Finding Protein and Nucleotide Similarities with FASTA , 2003, Current protocols in bioinformatics.

[27]  Thomas L. Madden,et al.  Protein sequence similarity searches using patterns as seeds. , 1998, Nucleic acids research.

[28]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[29]  D. Tautz Evolutionary biology: Debatable homologies , 1998, Nature.

[30]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[31]  W. J. Kent,et al.  Conservation, regulation, synteny, and introns in a large-scale C. briggsae-C. elegans genomic alignment. , 2000, Genome research.

[32]  R. Durbin,et al.  A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. , 1995, Gene.

[33]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[34]  W R Pearson,et al.  Flexible sequence similarity searching with the FASTA3 program package. , 2000, Methods in molecular biology.

[35]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[37]  Steven Salzberg,et al.  Mugsy: fast multiple alignment of closely related whole genomes , 2010, Bioinform..

[38]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Chris Upton,et al.  JDotter: a Java interface to multiple dotplots generated by dotter , 2004, Bioinform..

[40]  S. Altschul,et al.  Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Nicola K. Petty,et al.  BLAST Ring Image Generator (BRIG): simple prokaryote genome comparisons , 2011, BMC Genomics.

[42]  S. Henikoff,et al.  Blocks database and its applications. , 1996, Methods in enzymology.

[43]  William R Pearson,et al.  Finding Protein and Nucleotide Similarities with FASTA , 2003, Current protocols in bioinformatics.

[44]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[45]  Xavier Messeguer,et al.  M-GCAT: interactively and efficiently constructing large-scale multiple genome comparison frameworks in closely related species , 2006, BMC Bioinformatics.

[46]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[47]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[48]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[49]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[50]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[51]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Gregory Kucherov,et al.  YASS: enhancing the sensitivity of DNA similarity search , 2005, Nucleic Acids Res..

[53]  S Henikoff,et al.  Performance evaluation of amino acid substitution matrices , 1993, Proteins.

[54]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.