Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis

Genetic sequence alignment is the basis of many evolutionary and comparative studies, and errors in alignments lead to errors in the interpretation of evolutionary information in genomes. Traditional multiple sequence alignment methods disregard the phylogenetic implications of gap patterns that they create and infer systematically biased alignments with excess deletions and substitutions, too few insertions, and implausible insertion-deletion–event histories. We present a method that prevents these systematic errors by recognizing insertions and deletions as distinct evolutionary events. We show theoretically and practically that this improves the quality of sequence alignments and downstream analyses over a wide range of realistic alignment problems. These results suggest that insertions and sequence turnover are more common than is currently thought and challenge the conventional picture of sequence evolution and mechanisms of functional and structural changes.

[1]  M. Suchard,et al.  Alignment Uncertainty and Genomic Analysis , 2008, Science.

[2]  Colin N. Dewey,et al.  Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures , 2007, Nature.

[3]  David N. Messina,et al.  Evolutionary and Biomedical Insights from the Rhesus Macaque Genome , 2007, Science.

[4]  J. Overbaugh,et al.  Human Immunodeficiency Virus Type 1 V1-V2 Envelope Loop Sequences Expand and Add Glycosylation Sites over the Course of Infection, and These Modifications Affect Antibody Neutralization Sensitivity , 2006, Journal of Virology.

[5]  Yang Liu,et al.  Neutralizing antibody responses drive the evolution of human immunodeficiency virus type 1 envelope during recent HIV infection. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Michael S. Rosenberg,et al.  Multiple sequence alignment accuracy and evolutionary distance estimation , 2005, BMC Bioinformatics.

[7]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[8]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[9]  Lisa M. D'Souza,et al.  Genome sequence of the Brown Norway rat yields insights into mammalian evolution , 2004, Nature.

[10]  Sudhir Kumar,et al.  Taxon sampling, bioinformatics, and phylogenomics. , 2003, Systematic biology.

[11]  Derrick J. Zwickl,et al.  Increased taxon sampling is advantageous for phylogenetic inference. , 2002, Systematic biology.

[12]  J. Albert,et al.  Length variation of glycoprotein 120 V2 region in relation to biological phenotypes and coreceptor usage of primary HIV type 1 isolates. , 2001, AIDS research and human retroviruses.

[13]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[14]  J. Sodroski,et al.  Involvement of the V1/V2 variable loop structure in the exposure of human immunodeficiency virus type 1 gp120 epitopes induced by receptor binding , 1995, Journal of virology.

[15]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[16]  J. Sodroski,et al.  Effect of amino acid changes in the V1/V2 region of the human immunodeficiency virus type 1 gp120 glycoprotein on subunit association, syncytium formation, and recognition by a neutralizing antibody , 1993, Journal of virology.

[17]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[18]  K. Crandall,et al.  The causes and consequences of HIV evolution , 2004, Nature Reviews Genetics.

[19]  T. Rist,et al.  Materials and methods. , 1973, Archives of dermatology.

[20]  A. Löytynoja,et al.  From The Cover: An algorithm for progressive multiple alignment of sequences with , 2022 .