A novel missense-mutation-related feature extraction scheme for 'driver' mutation identification

MOTIVATION It becomes widely accepted that human cancer is a disease involving dynamic changes in the genome and that the missense mutations constitute the bulk of human genetic variations. A multitude of computational algorithms, especially the machine learning-based ones, has consequently been proposed to distinguish missense changes that contribute to the cancer progression ('driver' mutation) from those that do not ('passenger' mutation). However, the existing methods have multifaceted shortcomings, in the sense that they either adopt incomplete feature space or depend on protein structural databases which are usually far from integrated. RESULTS In this article, we investigated multiple aspects of a missense mutation and identified a novel feature space that well distinguishes cancer-associated driver mutations from passenger ones. An index (DX score) was proposed to evaluate the discriminating capability of each feature, and a subset of these features which ranks top was selected to build the SVM classifier. Cross-validation showed that the classifier trained on our selected features significantly outperforms the existing ones both in precision and robustness. We applied our method to several datasets of missense mutations culled from published database and literature and obtained more reasonable results than previous studies. AVAILABILITY The software is available online at http://www.methodisthealth.com/software and https://sites.google.com/site/drivermutationidentification/. CONTACT xzhou@tmhs.org. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Cathy H. Wu,et al.  Neural networks for full-scale protein sequence classification: Sequence encoding with singular value decomposition , 1995, Machine Learning.

[2]  R. Weinberg,et al.  The Biology of Cancer , 2006 .

[3]  R. Altman,et al.  Using the radial distributions of physical features to compare amino acid environments and align amino acid sequences. , 1997, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[4]  D. Hanahan,et al.  The Hallmarks of Cancer , 2000, Cell.

[5]  David R. Westhead,et al.  A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function , 2003, Bioinform..

[6]  R. Tibshirani,et al.  Comment on "The Consensus Coding Sequences of Human Breast and Colorectal Cancers" , 2007, Science.

[7]  B. Peters,et al.  Distinguishing cancer-associated missense mutations from common polymorphisms. , 2007, Cancer research.

[8]  A. Gonzalez-Perez,et al.  Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. , 2011, American journal of human genetics.

[9]  Guy Cavet,et al.  Comment on "The Consensus Coding Sequences of Human Breast and Colorectal Cancers" , 2007, Science.

[10]  M. Stratton,et al.  The cancer genome , 2009, Nature.

[11]  I. Weissman,et al.  Stem cells, cancer, and cancer stem cells , 2001, Nature.

[12]  Victor V. Solovyev,et al.  A novel method of protein sequence classification based on oligopeptide frequency analysis and its application to search for functional sites and to domain localization , 1993, Comput. Appl. Biosci..

[13]  Xiaobo Zhou,et al.  A semi-supervised learning approach to predict synthetic genetic interactions by combining functional and topological properties of functional gene network , 2010, BMC Bioinformatics.

[14]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[15]  Leyla Isik,et al.  Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. , 2009, Cancer research.

[16]  M. Stratton,et al.  Statistical Analysis of Pathogenicity of Somatic Mutations in Cancer , 2006, Genetics.

[17]  Matthew Meyerson,et al.  Somatic alterations in the human cancer genome. , 2004, Cancer cell.

[18]  Dennis Shasha,et al.  New techniques for extracting features from protein sequences , 2001, IBM Syst. J..

[19]  Warren C. Lathe,et al.  Prediction of deleterious human alleles. , 2001, Human molecular genetics.

[20]  G. Parmigiani,et al.  The Consensus Coding Sequences of Human Breast and Colorectal Cancers , 2006, Science.

[21]  Robert A Weinberg Cancer Biology & Therapy: The Road Ahead , 2002, Cancer biology & therapy.

[22]  J. Moult,et al.  Identification and analysis of deleterious human SNPs. , 2006, Journal of molecular biology.

[23]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt): an expanding universe of protein information , 2005, Nucleic Acids Res..

[24]  A. Sparks,et al.  The Genomic Landscapes of Human Breast and Colorectal Cancers , 2007, Science.

[25]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[26]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[27]  G. Parmigiani,et al.  Core Signaling Pathways in Human Pancreatic Cancers Revealed by Global Genomic Analyses , 2008, Science.

[28]  R. Grantham Amino Acid Difference Formula to Help Explain Protein Evolution , 1974, Science.

[29]  Giovanni Parmigiani,et al.  STATISTICAL METHODS FOR THE ANALYSIS OF CANCER GENOME SEQUENCING DATA , 2007 .

[30]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[31]  P. Stenson,et al.  Human Gene Mutation Database—A biomedical information and research resource , 2000, Human mutation.

[32]  D. Busam,et al.  An Integrated Genomic Analysis of Human Glioblastoma Multiforme , 2008, Science.

[33]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Cathy H. Wu,et al.  Protein classification artificial neural system , 1992, Protein science : a publication of the Protein Society.

[35]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[36]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[37]  S. Henikoff,et al.  Accounting for human polymorphisms predicted to affect protein function. , 2002, Genome research.

[38]  S. Henikoff,et al.  Predicting deleterious amino acid substitutions. , 2001, Genome research.

[39]  S Miyano,et al.  Open source clustering software. , 2004, Bioinformatics.