Frameshift alignment: statistics and post-genomic applications

MOTIVATION The alignment of DNA sequences to proteins, allowing for frameshifts, is a classic method in sequence analysis. It can help identify pseudogenes (which accumulate mutations), analyze raw DNA and RNA sequence data (which may have frameshift sequencing errors), investigate ribosomal frameshifts, etc. Often, however, only ad hoc approximations or simulations are available to provide the statistical significance of a frameshift alignment score. RESULTS We describe a method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics. (BLAST presently does not permit its alignments to include frameshifts.) We also illustrate the continuing usefulness of frameshift alignment with two 'post-genomic' applications: (i) when finding pseudogenes within the human genome, frameshift alignments show that most anciently conserved non-coding human elements are recent pseudogenes with conserved ancestral genes; and (ii) when analyzing metagenomic DNA reads from polluted soil, frameshift alignments show that most alignable metagenomic reads contain frameshifts, suggesting that metagenomic analysis needs to use frameshift alignment to derive accurate results.

[1]  W. A. Beyer,et al.  Some Biological Sequence Metrics , 1976 .

[2]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[3]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[4]  A. B. Robinson,et al.  Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[5]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[6]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.

[7]  M. Waterman,et al.  A Phase Transition for the Score in Matching Random Sequences Allowing Deletions , 1994 .

[8]  A. Dembo,et al.  Limit Distribution of Maximal Non-Aligned Two-Sequence Segmental Score , 1994 .

[9]  Xiaojun Guan,et al.  Alignments of DNA and protein sequences containing frameshift errors , 1996, Comput. Appl. Biosci..

[10]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[11]  W R Pearson,et al.  Comparison of DNA sequences with protein sequences. , 1997, Genomics.

[12]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[13]  William R. Pearson,et al.  Aligning a DNA sequence with a protein sequence , 1997, RECOMB '97.

[14]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[15]  Ralf Bundschuh,et al.  Rapid significance estimation in local sequence alignment with gaps , 2001, RECOMB.

[16]  S. Altschul,et al.  The estimation of statistical parameters for local alignment score distributions. , 2001, Nucleic acids research.

[17]  Mikhail S. Gelfand,et al.  Pro-Frame: similarity-based gene recognition in eukaryotic DNA sequences with errors , 2001, Bioinform..

[18]  Mikhail S. Gelfand,et al.  Exact mapping of prokaryotic gene starts , 2002, Briefings Bioinform..

[19]  Mark Gerstein,et al.  Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome. , 2003, Genome research.

[20]  D. Haussler,et al.  Ultraconserved Elements in the Human Genome , 2004, Science.

[21]  John L. Spouge,et al.  The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment , 2005, Nucleic acids research.

[22]  R. Agarwala,et al.  Composition-based statistics and translated nucleotide searches: Improving the TBLASTN module of BLAST , 2006, BMC Biology.

[23]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[24]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[25]  D. Goode,et al.  Early Evolution of Conserved Regulatory Sequences Associated with Development in Vertebrates , 2009, PLoS genetics.

[26]  Yonil Park,et al.  ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES. , 2009, Annals of statistics.

[27]  Gregory Kucherov,et al.  Back-translation for discovering distant protein homologies in the presence of frameshift mutations , 2010, Algorithms for Molecular Biology.

[28]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[29]  Yuan Zhang,et al.  HMM-FRAME: accurate protein domain classification for metagenomic sequences containing frameshift errors , 2011, BMC Bioinformatics.

[30]  M. Frith,et al.  Adaptive seeds tame genomic sequence comparison. , 2011, Genome research.

[31]  M. Frith A new repeat-masking method enables specific detection of homologous sequences , 2010, Nucleic acids research.

[32]  Jie Ding,et al.  Estimation of sequencing error rates in short reads , 2012, BMC Bioinformatics.

[33]  Ole Tange,et al.  GNU Parallel: The Command-Line Power Tool , 2011, login Usenix Mag..

[34]  Fredrik Lysholm Highly improved homopolymer aware nucleotide-protein alignments with 454 data , 2012, BMC Bioinformatics.

[35]  Takashi Ishida,et al.  GHOSTM: A GPU-Accelerated Homology Search Tool for Metagenomics , 2012, PloS one.

[36]  Ning Ma,et al.  New finite-size correction for local alignment score distributions , 2012, BMC Research Notes.

[37]  Claudia Stewart,et al.  Analysis of 454 sequencing error rate, error sources, and artifact recombination for detection of Low-frequency drug resistance mutations in HIV-1 DNA , 2013, Retrovirology.

[38]  Mauricio O. Carneiro,et al.  Pacific biosciences sequencing technology for genotyping and variation discovery in human data , 2012, BMC Genomics.

[39]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[40]  Yongan Zhao,et al.  RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data , 2011, Bioinform..

[41]  Mary Goldman,et al.  The UCSC Genome Browser database: extensions and updates 2011 , 2011, Nucleic Acids Res..

[42]  Naruya Saitou,et al.  Vertebrate Paralogous Conserved Noncoding Sequences May Be Related to Gene Expressions in Brain , 2012, Genome biology and evolution.

[43]  Rémi Bardenet,et al.  Monte Carlo Methods , 2013, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[44]  Mary Goldman,et al.  The UCSC Genome Browser database: extensions and updates 2013 , 2012, Nucleic Acids Res..

[45]  Martha Zakrzewski,et al.  Taxonomic Profiling and Metagenome Analysis of a Microbial Community from a Habitat Contaminated with Industrial Discharges , 2013, Microbial Ecology.

[46]  Laura Anderlucci,et al.  UCbase 2.0: ultraconserved sequences database (2014 update) , 2014, Database J. Biol. Databases Curation.

[47]  Chao Xie,et al.  A poor man’s BLASTX—high-throughput metagenomic protein database search using PAUDA , 2013, Bioinform..

[48]  Holly M. Bik,et al.  PhyloSift: phylogenetic analysis of genomes and metagenomes , 2014, PeerJ.