The whole alignment and nothing but the alignment: the problem of spurious alignment flanks

Pairwise sequence alignment is a ubiquitous tool for inferring the evolution and function of DNA, RNA and protein sequences. It is therefore essential to identify alignments arising by chance alone, i.e. spurious alignments. On one hand, if an entire alignment is spurious, statistical techniques for identifying and eliminating it are well known. On the other hand, if only a part of the alignment is spurious, elimination is much more problematic. In practice, even the sizes and frequencies of spurious subalignments remain unknown. This article shows that some common scoring schemes tend to overextend alignments and generate spurious alignment flanks up to hundreds of base pairs/amino acids in length. In the UCSC genome database, e.g. spurious flanks probably comprise >18% of the human–fugu genome alignment. To evaluate the possibility that chance alone generated a particular flank on a particular pairwise alignment, we provide a simple ‘overalignment’ P-value. The overalignment P-value can identify spurious alignment flanks, thereby eliminating potentially misleading inferences about evolution and function. Moreover, by explicitly demonstrating the tradeoff between over- and under-alignment, our methods guide the rational choice of scoring schemes for various alignment tasks.

[1]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[2]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[3]  Michael S. Waterman,et al.  Critical Phenomena in Sequence Matching , 1985 .

[4]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[5]  A. B. Robinson,et al.  Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[6]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[7]  S. Miyazawa A reliable sequence alignment method based on probabilities of residue correspondences. , 1995, Protein engineering.

[8]  M. Vingron,et al.  Quantifying the local reliability of a sequence alignment. , 1996, Protein engineering.

[9]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[10]  Piotr Berman,et al.  Post-processing long pairwise alignments , 1999, Bioinform..

[11]  Richard Mott,et al.  Approximate Statistics of Gapped Alignments , 1999, J. Comput. Biol..

[12]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[13]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[14]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[15]  Ralf Bundschuh,et al.  Rapid significance estimation in local sequence alignment with gaps , 2001, RECOMB.

[16]  Gregory R. Grant,et al.  Statistical Methods in Bioinformatics , 2001 .

[17]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[18]  Francesca Chiaromonte,et al.  Scoring Pairwise Genomic Sequence Alignments , 2001, Pacific Symposium on Biocomputing.

[19]  D. Haussler,et al.  Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[20]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[21]  Hock Peng Chan,et al.  Upper bounds and importance sampling of p-values for DNA and protein sequence alignments , 2003 .

[22]  J. Spouge Path reversal, islands, and the gapped alignment of random sequences , 2004 .

[23]  Fredj Tekaia,et al.  Continued Colonization of the Human Genome by Mitochondrial DNA , 2004, PLoS biology.

[24]  Douda Bensasson,et al.  Transition-Transversion Bias Is Not Universal: A Counter Example from Grasshopper Pseudogenes , 2007, PLoS genetics.

[25]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[26]  Alexandre Z. Caldeira,et al.  Uncertainty in homology inferences: assessing and improving genomic sequence alignment. , 2008, Genome research.

[27]  C. Lawrence,et al.  Centroid estimation in discrete high-dimensional spaces with applications in biology , 2008, Proceedings of the National Academy of Sciences.

[28]  Lee Ann McCue,et al.  Measuring Global Credibility with Application to Local Sequence Alignment , 2008, PLoS Comput. Biol..

[29]  Yonil Park,et al.  ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES. , 2009, Annals of statistics.