Improved feature-based prediction of SNPs in human cytochrome P450 enzymes

Single nucleotide polymorphisms (SNPs) make up the most common form of mutations in human cytochrome P450 enzymes family, and have the potential to bring with different drug responses or specific diseases in individual patients. Here, based on machine learning technology, we aim to explore an effective set of sequence-based features for improving prediction of SNPs by using support vector machine algorithms. The features are derived from the target residues and flanking protein sequences, such as amino acid types, sequences composition, physicochemical properties, position-specific scoring matrix, phylogenetic entropy and the number of possible codons of target residues. In order to deal with the imbalance data with a majority of non-SNPs and a minority of SNPs, a preprocessing strategy based on fuzzy set theory was applied to the datasets. Our final model achieves the performance of 93.8% in sensitivity, 88.8% in specificity, 91.3% in accuracy and 0.971 of AUC value, which is significantly higher than the previous DNA sequence-based or protein sequence-based methods. Furthermore, our study also suggested the roles of individual features for prediction of SNPs. The most important features consist of the amino acid type, the number of available codons, position-specific scoring matrix and phylogenetic entropy. The improved model will be a promising tool for SNP predictions, and assist in the research of genome mutation and personalized prescriptions.

[1]  A. Komar,et al.  Silent SNPs: impact on gene function and phenotype. , 2007, Pharmacogenomics.

[2]  S. Henikoff,et al.  Predicting deleterious amino acid substitutions. , 2001, Genome research.

[3]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[4]  Jing Hu,et al.  SIFT web server: predicting effects of amino acid substitutions on proteins , 2012, Nucleic Acids Res..

[5]  V. Bajic,et al.  Simplified Method to Predict Mutual Interactions of Human Transcription Factors Based on Their Primary Structure , 2011, PloS one.

[6]  Andrew D. Johnson,et al.  SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap , 2008, Bioinform..

[7]  Rolf Hilfiker,et al.  The use of single-nucleotide polymorphism maps in pharmacogenomics , 2000, Nature Biotechnology.

[8]  Joseph McGraw,et al.  Cytochrome P450 variations in different ethnic populations , 2012, Expert opinion on drug metabolism & toxicology.

[9]  Gang Yao,et al.  Generalized Rough Set Model on De Morgan Algebras , 2007 .

[10]  Alexandre Perera-Lluna,et al.  A Subspace Method for the Detection of Transcription Factor Binding Sites , 2012, BIOINFORMATICS.

[11]  Shao-Ping Shi,et al.  PLMLA: prediction of lysine methylation and lysine acetylation by combining multiple features. , 2012, Molecular bioSystems.

[12]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[13]  S. Henikoff,et al.  Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm , 2009, Nature Protocols.

[14]  Sanghamitra Bandyopadhyay,et al.  MicroRNA Transcription Start Site Prediction with Multi-objective Feature Selection , 2012, Statistical applications in genetics and molecular biology.

[15]  Lei Wang,et al.  Divergence Involving Global Regulatory Gene Mutations in an Escherichia coli Population Evolving under Phosphate Limitation , 2010, Genome biology and evolution.

[16]  Junfeng Xia,et al.  Exploiting a Reduced Set of Weighted Average Features to Improve Prediction of DNA-Binding Residues from 3D Structures , 2011, PloS one.

[17]  R. Ge,et al.  CYP2C9 polymorphism analysis in Han Chinese populations: building the largest allele frequency database , 2013, The Pharmacogenomics Journal.

[18]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[19]  Rui Yan,et al.  Comparison of Machine Learning and Pattern Discovery Algorithms for the Prediction of Human Single Nucleotide Polymorphisms , 2007 .

[20]  G. Wright,et al.  Introduction of the AmpliChip CYP450 Test to a South African cohort: a platform comparative prospective cohort study , 2013, BMC Medical Genetics.

[21]  Chao Ma,et al.  Ligand Classifier of Adaptively Boosting Ensemble Decision Stumps (LiCABEDS) and Its Application on Modeling Ligand Functionality for 5HT-Subtype GPCR Families , 2011, J. Chem. Inf. Model..

[22]  Amanda C. Schierz Virtual screening of bioassay data , 2009, J. Cheminformatics.

[23]  K. Chou,et al.  SCYPPred: a web-based predictor of SNPs for human cytochrome P450. , 2012, Protein and peptide letters.

[24]  Der-Chiang Li,et al.  Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge , 2007, Comput. Oper. Res..

[25]  Michael Brudno,et al.  Identification of deleterious synonymous variants in human genomes , 2013, Bioinform..

[26]  P. Bork,et al.  Human non-synonymous SNPs: server and survey. , 2002, Nucleic acids research.

[27]  J. Castle SNPs Occur in Regions with Less Genomic Sequence Conservation , 2011, PloS one.

[28]  Yi Xiong,et al.  An accurate feature‐based method for identifying DNA‐binding residues on protein surfaces , 2011, Proteins.

[29]  Joel Hirschhorn,et al.  SNPsnap: a Web-based tool for identification and annotation of matched SNPs , 2015, Bioinform..

[30]  Wen-Lian Hsu,et al.  Predicting RNA-binding sites of proteins using support vector machines and evolutionary information , 2008, BMC Bioinformatics.

[31]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[32]  Der-Chiang Li,et al.  A learning method for the class imbalance problem with medical data sets , 2010, Comput. Biol. Medicine.

[33]  Rong Yan,et al.  On predicting rare classes with SVM ensembles in scene classification , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..