NGlyAlign: an automated library building tool to align highly divergent HIV envelope sequences

Background The high variability in envelope regions of some viruses such as HIV allow the virus to establish infection and to escape subsequent immune surveillance. This variability, as well as increasing incorporation of N-linked glycosylation sites, is fundamental to this evasion. It also creates difficulties for multiple sequence alignment methods (MSA) that provide the first step in their analysis. Existing MSA tools often fail to properly align highly variable HIV envelope sequences requiring extensive manual editing that is impractical with even a moderate number of these variable sequences. Results We developed an automated library building tool NGlyAlign, that organizes similar N-linked glycosylation sites as block constraints and statistically conserved global sites as single site constraints to automatically enforce partial columns in consistency-based MSA methods such as Dialign. This combined method accurately aligns variable HIV-1 envelope sequences. We tested the method on two datasets: a set of 156 founder and chronic gp160 HIV-1 subtype B sequences as well as a set of reference sequences of gp120 in the highly variable region 1. On measures such as entropy scores, sum of pair scores, column score, and similarity heat maps, NGlyAlign+Dialign proved superior against methods such as T-Coffee, ClustalOmega, ClustalW, Praline, HIValign and Muscle. The method is scalable to large sequence sets producing accurate alignments without requiring manual editing. As well as this application to HIV, our method can be used for other highly variable glycoproteins such as hepatitis C virus envelope. Conclusions NGlyAlign is an automated tool for mapping and building glycosylation motif libraries to accurately align highly variable regions in HIV sequences. It can provide the basis for many studies reliant on single robust alignments. NGlyAlign has been developed as an open-source tool and is freely available at https://github.com/UNSW-Mathematical-Biology/NGlyAlign_v1.0 .

[1]  Brendan J. Frey,et al.  Machine Learning in Genomic Medicine: A Review of Computational Problems and Data Sets , 2016, Proceedings of the IEEE.

[2]  E. Go,et al.  Glycosylation site-specific analysis of clade C HIV-1 envelope proteins. , 2009, Journal of proteome research.

[3]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[4]  L. Kasturi,et al.  The amino acid following an asn-X-Ser/Thr sequon is an important determinant of N-linked core glycosylation efficiency. , 1998, Biochemistry.

[5]  Anna R. Panchenko,et al.  Structural and Functional Roles of Coevolved Sites in Proteins , 2010, PloS one.

[6]  Simon Easteal,et al.  Mind the gaps: evidence of bias in estimates of multiple sequence alignments. , 2007, Molecular biology and evolution.

[7]  F. Penin,et al.  Coevolution analysis of Hepatitis C virus genome to identify the structural and functional dependency network of viral proteins , 2016, Scientific Reports.

[8]  J. M. Sauder,et al.  Large‐scale comparison of protein sequence alignment algorithms with structure alignments , 2000, Proteins.

[9]  Burkhard Morgenstern,et al.  DIALIGN at GOBICS—multiple sequence alignment using various sources of external information , 2013, Nucleic Acids Res..

[10]  D. Morrison Multiple sequence alignment for phylogenetic purposes , 2006 .

[11]  Jaap Heringa,et al.  PRALINE: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information , 2005, Nucleic Acids Res..

[12]  Steven Skiena,et al.  Lowest common ancestors in trees and directed acyclic graphs , 2005, J. Algorithms.

[13]  Hui Li,et al.  Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection , 2008, Proceedings of the National Academy of Sciences.

[14]  Alan Bridge,et al.  New and continuing developments at PROSITE , 2012, Nucleic Acids Res..

[15]  Olivier Poch,et al.  A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives , 2011, PloS one.

[16]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[17]  Jaap Heringa,et al.  HIV-1 envelope glycoprotein signatures that correlate with the development of cross-reactive neutralizing activity , 2013, Retrovirology.

[18]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[19]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[20]  R. Center,et al.  Differentiating founder and chronic HIV envelope sequences , 2017, PloS one.

[21]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[22]  Andrew Rambaut,et al.  HIV Sequence Compendium 2018 , 2018 .

[23]  Michael Gribskov,et al.  Methods and Statistics for Combining Motif Match Scores , 1998, J. Comput. Biol..

[24]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[25]  Geoffrey J. Barton,et al.  Jalview Version 2—a multiple sequence alignment editor and analysis workbench , 2009, Bioinform..

[26]  Robert C. Edgar,et al.  Multiple sequence alignment. , 2006, Current opinion in structural biology.

[27]  Maurits J. J. Dijkstra,et al.  Multiple Sequence Alignment. , 2017, Methods in molecular biology.

[28]  Jean Dubuisson,et al.  Glycan Shielding and Modulation of Hepatitis C Virus Neutralizing Antibodies , 2018, Front. Immunol..

[29]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[30]  S. Hammer,et al.  The challenge of HIV-1 subtype diversity. , 2008, The New England journal of medicine.

[31]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[32]  Brian T. Foley,et al.  Retrieval and on-the-fly alignment of sequence fragments from the HIV database , 2001, Bioinform..

[33]  Bette T. Korber,et al.  Envelope-Constrained Neutralization-Sensitive HIV-1 After Heterosexual Transmission , 2004, Science.

[34]  Thomas Shafee,et al.  AlignStat: a web-tool and R package for statistical comparison of alternative multiple sequence alignments , 2016, BMC Bioinformatics.

[35]  Hugh B Nicholas,et al.  Strategies for multiple sequence alignment. , 2002, BioTechniques.

[36]  Cédric Notredame,et al.  Upcoming challenges for multiple sequence alignment methods in the high-throughput era , 2009, Bioinform..

[37]  Sudhir Kumar,et al.  MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms. , 2018, Molecular biology and evolution.

[38]  Laerte Oliveira,et al.  Identification of functionally conserved residues with the use of entropy–variability plots , 2003, Proteins.

[39]  Matthew R. McKay,et al.  Fitness landscape of the human immunodeficiency virus envelope protein that is targeted by antibodies , 2018, Proceedings of the National Academy of Sciences.

[40]  Massimiliano Pontil,et al.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments , 2012, Bioinform..