HIV-1 coreceptor usage prediction without multiple alignments: an application of string kernels

BackgroundHuman immunodeficiency virus type 1 (HIV-1) infects cells by means of ligand-receptor interactions. This lentivirus uses the CD4 receptor in conjunction with a chemokine coreceptor, either CXCR4 or CCR5, to enter a target cell. HIV-1 is characterized by high sequence variability. Nonetheless, within this extensive variability, certain features must be conserved to define functions and phenotypes. The determination of coreceptor usage of HIV-1, from its protein envelope sequence, falls into a well-studied machine learning problem known as classification. The support vector machine (SVM), with string kernels, has proven to be very efficient for dealing with a wide class of classification problems ranging from text categorization to protein homology detection. In this paper, we investigate how the SVM can predict HIV-1 coreceptor usage when it is equipped with an appropriate string kernel.ResultsThree string kernels were compared. Accuracies of 96.35% (CCR5) 94.80% (CXCR4) and 95.15% (CCR5 and CXCR4) were achieved with the SVM equipped with the distant segments kernel on a test set of 1425 examples with a classifier built on a training set of 1425 examples. Our datasets are built with Los Alamos National Laboratory HIV Databases sequences. A web server is available at http://genome.ulaval.ca/hiv-dskernel.ConclusionWe examined string kernels that have been used successfully for protein homology detection and propose a new one that we call the distant segments kernel. We also show how to extract the most relevant features for HIV-1 coreceptor usage. The SVM with the distant segments kernel is currently the best method described.

[1]  Lynn Morris,et al.  A Reliable Phenotype Predictor for Human Immunodeficiency Virus Type 1 Subtype C Based on Envelope V3 Sequences , 2006, Journal of Virology.

[2]  Tatsuya Akutsu,et al.  Protein homology detection using string alignment kernels , 2004, Bioinform..

[3]  Andrew J. Low,et al.  Predicting HIV Coreceptor Usage on the Basis of Genetic and Clinical Covariates , 2007, Antiviral therapy.

[4]  Shungao Xu,et al.  Improved prediction of coreceptor usage and phenotype of HIV-1 based on combined features of V3 loop sequence using random forest. , 2007, Journal of microbiology.

[5]  Jacques Corbeil,et al.  A new perspective on V3 phenotype prediction. , 2003, AIDS research and human retroviruses.

[6]  R. Swanstrom,et al.  Improved success of phenotype prediction of the human immunodeficiency virus type 1 from envelope variable loop 3 sequence using neural networks. , 2001, Virology.

[7]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[8]  D. Richman,et al.  The impact of the syncytium-inducing phenotype of human immunodeficiency virus on disease progression. , 1994, The Journal of infectious diseases.

[9]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[10]  E. Holmes,et al.  Selection for specific sequences in the external envelope protein of human immunodeficiency virus type 1 upon primary infection , 1993, Journal of virology.

[11]  H. Schuitemaker,et al.  Phenotype-associated sequence variation in the third variable domain of the human immunodeficiency virus type 1 gp120 molecule , 1992, Journal of virology.

[12]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[13]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[14]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[15]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[16]  J. Langford Tutorial on Practical Prediction Theory for Classification , 2005, J. Mach. Learn. Res..

[17]  J. Corbeil,et al.  R5 and X4 HIV viruses differentially modulate host gene expression in resting CD4+ T cells. , 2008, AIDS research and human retroviruses.

[18]  Tobias Sing,et al.  Determining Human Immunodeficiency Virus Coreceptor Use in a Clinical Setting: Degree of Correlation between Two Phenotypic Assays and a Bioinformatic Model , 2006, Journal of Clinical Microbiology.

[19]  Li Liao,et al.  Combining pairwise sequence similarity and support vector machines for remote protein homology detection , 2002, RECOMB '02.

[20]  J. Margolick,et al.  Improved Coreceptor Usage Prediction and GenotypicMonitoring of R5-to-X4 Transition by Motif Analysis of HumanImmunodeficiency Virus Type 1 env V3 LoopSequences , 2003, Journal of Virology.

[21]  Thomas Lengauer,et al.  Bioinformatics prediction of HIV coreceptor usage , 2007, Nature Biotechnology.

[22]  L. Cuzin,et al.  Correlation between genotypic predictions based on V3 sequences and phenotypic determination of HIV-1 tropism , 2008, AIDS.

[23]  Thomas Lengauer,et al.  Structural Descriptors of gp120 V3 Loop for the Prediction of HIV-1 Coreceptor Usage , 2007, PLoS Comput. Biol..

[24]  B. Margolin,et al.  V3 loop of the human immunodeficiency virus type 1 Env protein: interpreting sequence variability , 1993, Journal of virology.

[25]  G. Fogel,et al.  Prediction of R5, X4, and R5X4 HIV-1 Coreceptor Usage with Evolved Neural Networks , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[26]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[27]  Peter Meinicke,et al.  Remote homology detection based on oligomer distances , 2006, Bioinform..

[28]  Steven Wolinsky,et al.  Bioinformatic prediction programs underestimate the frequency of CXCR4 usage by R5X4 HIV type 1 in brain and other tissues. , 2008, AIDS research and human retroviruses.

[29]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .