MACSE: Multiple Alignment of Coding SEquences Accounting for Frameshifts and Stop Codons

Until now the most efficient solution to align nucleotide sequences containing open reading frames was to use indirect procedures that align amino acid translation before reporting the inferred gap positions at the codon level. There are two important pitfalls with this approach. Firstly, any premature stop codon impedes using such a strategy. Secondly, each sequence is translated with the same reading frame from beginning to end, so that the presence of a single additional nucleotide leads to both aberrant translation and alignment. We present an algorithm that has the same space and time complexity as the classical Needleman-Wunsch algorithm while accommodating sequencing errors and other biological deviations from the coding frame. The resulting pairwise coding sequence alignment method was extended to a multiple sequence alignment (MSA) algorithm implemented in a program called MACSE (Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons). MACSE is the first automatic solution to align protein-coding gene datasets containing non-functional sequences (pseudogenes) without disrupting the underlying codon structure. It has also proved useful in detecting undocumented frameshifts in public database sequences and in aligning next-generation sequencing reads/contigs against a reference coding sequence. MACSE is distributed as an open-source java file executable with freely available source code and can be used via a web interface at: http://mbb.univ-montp2.fr/macse.

[1]  Peer Bork,et al.  PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments , 2006, Nucleic Acids Res..

[2]  Lior Pachter,et al.  Fast Statistical Alignment , 2009, PLoS Comput. Biol..

[3]  Jan Schröder,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[4]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[5]  M. Suchard,et al.  Alignment Uncertainty and Genomic Analysis , 2008, Science.

[6]  Philipp Kapranov,et al.  Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution. , 2007, Genome research.

[7]  W. Miller,et al.  Recharacterization of ancient DNA miscoding lesions: insights in the era of sequencing-by-synthesis , 2006, Nucleic acids research.

[8]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[9]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[10]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[11]  W. Murphy,et al.  Molecular Decay of the Tooth Gene Enamelin (ENAM) Mirrors the Loss of Enamel in the Fossil Record of Placental Mammals , 2009, PLoS genetics.

[12]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[13]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[14]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Andrew M. Jenkinson,et al.  Ensembl 2009 , 2008, Nucleic Acids Res..

[16]  A. Meyer,et al.  The Ghost of Selection Past: Rates of Evolution and Functional Divergence of Anciently Duplicated Genes , 2001, Journal of Molecular Evolution.

[17]  S. Altschul,et al.  Optimal sequence alignment using affine gap costs. , 1986, Bulletin of mathematical biology.

[18]  M. Stanhope,et al.  Rodent phylogeny and a timescale for the evolution of Glires: evidence from an extensive taxon sampling using three nuclear genes. , 2002, Molecular biology and evolution.

[19]  T. Wetter,et al.  Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. , 2004, Genome research.

[20]  Manolo Gouy,et al.  SEAVIEW and PHYLO_WIN: two graphic tools for sequence alignment and molecular phylogeny , 1996, Comput. Appl. Biosci..

[21]  Federico Abascal,et al.  TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations , 2010, Nucleic Acids Res..

[22]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[23]  Olaf R. P. Bininda-Emonds,et al.  transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences , 2005, BMC Bioinformatics.

[24]  S. Altschul Gap costs for multiple sequence alignment. , 1989, Journal of theoretical biology.

[25]  B. Rost,et al.  Alignments grow, secondary structure prediction improves , 2002, Proteins.

[26]  Olivier Poch,et al.  A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives , 2011, PloS one.

[27]  D Sankoff,et al.  Matching sequences under deletion-insertion constraints. , 1972, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Peter F. Stadler,et al.  Multiple sequence alignments of partially coding nucleic acid sequences , 2005, BMC Bioinformatics.

[29]  H. Philippe,et al.  Large-scale sequencing and the new animal phylogeny. , 2006, Trends in ecology & evolution.

[30]  Peter H. A. Sneath,et al.  Numerical Taxonomy: The Principles and Practice of Numerical Classification , 1973 .

[31]  A. Löytynoja,et al.  Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis , 2008, Science.

[32]  Lars Arvestad,et al.  Aligning Coding DNA in the Presence of Frame-Shift Errors , 1997, CPM.

[33]  P. Farabaugh Programmed translational frameshifting. , 1996, Annual review of genetics.

[34]  Frédéric Delsuc,et al.  OrthoMaM: A database of orthologous genomic markers for placental mammal phylogenetics , 2007, BMC Evolutionary Biology.

[35]  John D. Kececioglu,et al.  Aligning Alignments , 1998, CPM.

[36]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[37]  Burkhard Morgenstern,et al.  Exon discovery by genomic sequence alignment , 2002, Bioinform..

[38]  A. Berta,et al.  Morphological and molecular evidence for a stepwise evolutionary transition from teeth to baleen in mysticete whales. , 2008, Systematic biology.

[39]  John D. Kececioglu,et al.  Aligning alignments exactly , 2004, RECOMB.

[40]  F. Delsuc,et al.  Additional molecular support for the new chordate phylogeny , 2008, Genesis.

[41]  J Hein,et al.  An algorithm combining DNA and protein alignment. , 1994, Journal of theoretical biology.

[42]  A. Reymond,et al.  Conserved non-genic sequences — an unexpected feature of mammalian genomes , 2005, Nature Reviews Genetics.

[43]  Martin Kircher,et al.  Improved base calling for the Illumina Genome Analyzer using machine learning strategies , 2009, Genome Biology.

[44]  Anders Gorm Pedersen,et al.  RevTrans: multiple alignment of coding DNA from aligned amino acid sequences , 2003, Nucleic Acids Res..

[45]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[46]  John D. Kececioglu,et al.  Multiple alignment by aligning alignments , 2007, ISMB/ECCB.

[47]  Christian N. S. Pedersen,et al.  Comparison of Coding DNA , 1998, CPM.

[48]  O. Gascuel,et al.  SeaView version 4: A multiplatform graphical user interface for sequence alignment and phylogenetic tree building. , 2010, Molecular biology and evolution.

[49]  Jeroen Raes,et al.  Functional divergence of proteins through frameshift mutations. , 2005, Trends in genetics : TIG.

[50]  Robert C. Edgar,et al.  Local homology recognition and distance measures in linear time using compressed amino acid alphabets. , 2004, Nucleic acids research.

[51]  Rainer Fuchs,et al.  CLUSTAL V: improved software for multiple sequence alignment , 1992, Comput. Appl. Biosci..

[52]  Jianzhi Zhang,et al.  Widespread losses of vomeronasal signal transduction in bats. , 2011, Molecular biology and evolution.

[53]  B. Kempenaers,et al.  Avian olfactory receptor gene repertoires: evidence for a well-developed sense of smell in birds? , 2008, Proceedings of the Royal Society B: Biological Sciences.

[54]  Xiaojun Guan,et al.  Alignments of DNA and protein sequences containing frameshift errors , 1996, Comput. Appl. Biosci..