Ensembled support vector machines for human papillomavirus risk type prediction from protein secondary structures

Infection by the human papillomavirus (HPV) is regarded as the major risk factor in the development of cervical cancer. Detection of high-risk HPV is important for understanding its oncogenic mechanisms and for developing novel clinical tools for its diagnosis, treatment, and prevention. Several methods are available to predict the risk types for HPV protein sequences. Nevertheless, no tools can achieve a universally good performance for all domains, including HPV and nor do they provide confidence levels for their decisions. Here, we describe ensembled support vector machines (SVMs) to classify HPV risk types, which assign given proteins into high-, possibly high-, or low-risk type based on their confidence level. Our approach uses protein secondary structures to obtain the differential contribution of subsequences for the risk type, and SVM classifiers are combined with a simple but efficient string kernel to handle HPV protein sequences. In the experiments, we compare our approach with previous methods in accuracy and F1-score, and present the predictions for unknown HPV types, which provides promising results.

[1]  A A Salamov,et al.  Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignments. , 1995, Journal of molecular biology.

[2]  Arthur J. Olson,et al.  The Serine-rich Domain from Crk-associated Substrate (p130cas) Is a Four-helix Bundle* , 2005, Journal of Biological Chemistry.

[3]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[4]  S. Beaudenon,et al.  Two novel genital human papillomavirus (HPV) types, HPV68 and HPV70, related to the potentially oncogenic HPV39 , 1996, Journal of clinical microbiology.

[5]  Byoung-Tak Zhang,et al.  Text filtering by boosting naive Bayes classifiers , 2000, SIGIR '00.

[6]  Jong-Sup Park,et al.  Role of proteomics in translational research in cervical cancer , 2006, Expert review of proteomics.

[7]  Julian Peto,et al.  Prevalence of Human Papillomavirus in Cervical Cancer: a Worldwide Perspective , 1995 .

[9]  Byoung-Tak Zhang,et al.  Prediction of the Risk Types of Human Papillomaviruses by Support Vector Machines , 2004, PRICAI.

[10]  M. Janicek,et al.  Cervical Cancer: Prevention, Diagnosis, and Therapeutics , 2001, CA: a cancer journal for clinicians.

[11]  Volker Brass,et al.  An Amino-terminal Amphipathic α-Helix Mediates Membrane Association of the Hepatitis C Virus Nonstructural Protein 5A* , 2002, The Journal of Biological Chemistry.

[12]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[13]  Marie Nguyen,et al.  A Mutant of Human Papillomavirus Type 16 E6 Deficient in Binding α-Helix Partners Displays Reduced Oncogenic Potential In Vivo , 2002, Journal of Virology.

[14]  R Langridge,et al.  Improvements in protein secondary structure prediction by an enhanced neural network. , 1990, Journal of molecular biology.

[15]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[16]  J. Gibrat,et al.  Protein secondary structure assignment revisited: a detailed analysis of different assignment methods , 2005, BMC Structural Biology.

[17]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[18]  F. X. Bosch,et al.  Epidemiologic classification of human papillomavirus types associated with cervical cancer. , 2003, The New England journal of medicine.

[19]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[20]  David A. Ornelles,et al.  An Arginine-Faced Amphipathic Alpha Helix Is Required for Adenovirus Type 5 E4orf6 Protein Function , 1999, Journal of Virology.

[21]  Minoru Irahara,et al.  Human papilloma virus (HPV) and cervical cancer. , 2002, The journal of medical investigation : JMI.

[23]  B. Rost Review: protein secondary structure prediction continues to rise. , 2001, Journal of structural biology.

[24]  Salvatore J. Stolfo,et al.  AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[25]  M. Nair,et al.  High-Risk Human Papillomavirus Infection and E6 Protein Expression in Lesions of the Uterine Cervix , 1998, Pathobiology.

[26]  G J Barton,et al.  Application of multiple sequence alignment profiles to improve protein secondary structure prediction , 2000, Proteins.

[27]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[28]  Byoung-Tak Zhang,et al.  Human Papillomavirus Risk Type Classification from Protein Sequences Using Support Vector Machines , 2006, EvoWorkshops.

[29]  E. Stockfleth,et al.  Association of rare human papillomavirus types with genital premalignant and malignant lesions. , 1998, The Journal of infectious diseases.

[30]  R. A. Hubbard,et al.  Human papillomavirus testing methods. , 2009, Archives of pathology & laboratory medicine.

[31]  Yücel Altunbasak,et al.  Protein secondary structure prediction for a single-sequence using hidden semi-Markov models , 2006, BMC Bioinformatics.

[32]  Ching-Chang Chieng,et al.  Molecular dynamics simulation of the enhancement of cobra cardiotoxin and E6 protein binding on mixed self-assembled monolayer molecules , 2006, Nanotechnology.

[33]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[34]  Byoung-Tak Zhang,et al.  Mining the Risk Types of Human Papillomavirus (HPV) by AdaCost , 2003, DEXA.

[35]  S H Ross,et al.  Cervical Cancer Prevention , 2007 .

[36]  Byoung-Tak Zhang,et al.  Protein sequence-based risk classification for human papillomaviruses , 2006, Comput. Biol. Medicine.