In silico prediction of novel therapeutic targets using gene–disease association data

BackgroundTarget identification and validation is a pressing challenge in the pharmaceutical industry, with many of the programmes that fail for efficacy reasons showing poor association between the drug target and the disease. Computational prediction of successful targets could have a considerable impact on attrition rates in the drug discovery pipeline by significantly reducing the initial search space. Here, we explore whether gene–disease association data from the Open Targets platform is sufficient to predict therapeutic targets that are actively being pursued by pharmaceutical companies or are already on the market.MethodsTo test our hypothesis, we train four different classifiers (a random forest, a support vector machine, a neural network and a gradient boosting machine) on partially labelled data and evaluate their performance using nested cross-validation and testing on an independent set. We then select the best performing model and use it to make predictions on more than 15,000 genes. Finally, we validate our predictions by mining the scientific literature for proposed therapeutic targets.ResultsWe observe that the data types with the best predictive power are animal models showing a disease-relevant phenotype, differential expression in diseased tissue and genetic association with the disease under investigation. On a test set, the neural network classifier achieves over 71% accuracy with an AUC of 0.76 when predicting therapeutic targets in a semi-supervised learning setting. We use this model to gain insights into current and failed programmes and to predict 1431 novel targets, of which a highly significant proportion has been independently proposed in the literature.ConclusionsOur in silico approach shows that data linking genes and diseases is sufficient to predict novel therapeutic targets effectively and confirms that this type of evidence is essential for formulating or strengthening hypotheses in the target discovery process. Ultimately, more rapid and automated target prioritisation holds the potential to reduce both the costs and the development times associated with bringing new medicines to patients.

[1]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[2]  Christoph Sommer,et al.  Machine learning in cell biology – teaching computers to recognize phenotypes , 2013, Journal of Cell Science.

[3]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[4]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[5]  J. Nazuno Haykin, Simon. Neural networks: A comprehensive foundation, Prentice Hall, Inc. Segunda Edición, 1999 , 2000 .

[6]  George Papadatos,et al.  The ChEMBL bioactivity database: an update , 2013, Nucleic Acids Res..

[7]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[8]  Christopher M. Overall,et al.  Validating matrix metalloproteinases as drug targets and anti-targets for cancer therapy , 2006, Nature Reviews Cancer.

[9]  Jun S. Liu,et al.  Genetics of rheumatoid arthritis contributes to biology and drug discovery , 2013 .

[10]  Janos X. Binder,et al.  DISEASES: Text mining and data integration of disease–gene associations , 2014, bioRxiv.

[11]  C. Brinckerhoff,et al.  Matrix metalloproteinases: role in arthritis. , 2006, Frontiers in bioscience : a journal and virtual library.

[12]  R. W. Hansen,et al.  Journal of Health Economics , 2016 .

[13]  Jean-Philippe Vert,et al.  ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples , 2011, BMC Bioinformatics.

[14]  Jian Cao,et al.  Targeting matrix metalloproteinases in cancer: Bringing new life to old ideas , 2015, Genes & diseases.

[15]  Fionn Murtagh,et al.  Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion? , 2011, Journal of Classification.

[16]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[17]  Heike Wulff,et al.  Voltage-gated potassium channels as therapeutic targets , 2009, Nature Reviews Drug Discovery.

[18]  Alexander E. Ivliev,et al.  Drug Target Prediction and Repositioning Using an Integrated Network-Based Approach , 2013, PloS one.

[19]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[20]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[21]  M. Handel,et al.  The dual bromodomain and WD repeat-containing mouse protein BRWD1 is required for normal spermiogenesis and the oocyte-embryo transition. , 2008, Developmental biology.

[22]  Stefan Wiemann,et al.  Identification and characterization of a set of conserved and new regulators of cytoskeletal organization, cell morphology and migration , 2011, BMC Biology.

[23]  Andrey Rzhetsky,et al.  Quantitative systems-level determinants of human genes targeted by successful drugs. , 2008, Genome research.

[24]  Ji Luo CRISPR/Cas9: From Genome Engineering to Cancer Drug Discovery. , 2016, Trends in cancer.

[25]  A. Gingras,et al.  Histone Recognition and Large-Scale Structural Analysis of the Human Bromodomain Family , 2012, Cell.

[26]  M. Pangalos,et al.  Lessons learned from the fate of AstraZeneca's drug pipeline: a five-dimensional framework , 2014, Nature Reviews Drug Discovery.

[27]  Bernd Bischl,et al.  Resampling Methods for Meta-Model Validation with Recommendations for Evolutionary Computation , 2012, Evolutionary Computation.

[28]  Bart De Moor,et al.  Assessing binary classifiers using only positive and unlabeled data , 2015, ArXiv.

[29]  Philip M. Kim,et al.  A systematic approach to identify novel cancer drug targets using machine learning, inhibitor design and high-throughput screening , 2014, Genome Medicine.

[30]  A. Statnikov,et al.  Strategic Applications of Gene Expression: From Drug Discovery/Development to Bedside , 2013, The AAPS Journal.

[31]  E. Birney,et al.  Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt , 2009, Nature Protocols.

[32]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[33]  Jeroen Ooms,et al.  The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects , 2014, ArXiv.

[34]  D. Altshuler,et al.  Validating therapeutic targets through human genetics , 2013, Nature Reviews Drug Discovery.

[35]  Pingping Shen,et al.  TAB1: a target of triptolide in macrophages. , 2014, Chemistry & biology.

[36]  Peggy Hall,et al.  The NHGRI GWAS Catalog, a curated resource of SNP-trait associations , 2013, Nucleic Acids Res..

[37]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[38]  F. Agakov,et al.  Application of high-dimensional feature selection: evaluation for genomic prediction in man , 2015, Scientific Reports.

[39]  S. Knapp,et al.  Targeting bromodomains: epigenetic readers of lysine acetylation , 2014, Nature Reviews Drug Discovery.

[40]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[41]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[42]  Mark E. Davis,et al.  Clinical experiences with systemically administered siRNA-based therapeutics in cancer , 2015, Nature Reviews Drug Discovery.

[43]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[44]  Wei Jiang,et al.  The analysis of the drug–targets based on the topological properties in the human protein–protein interaction network , 2009, Journal of drug targeting.

[45]  Chee Keong Kwoh,et al.  Positive-unlabeled learning for disease gene identification , 2012, Bioinform..

[46]  Sarah C. Ayling,et al.  The Ensembl gene annotation system , 2016, Database J. Biol. Databases Curation.

[47]  K. Irie,et al.  TAB1: An Activator of the TAK1 MAPKKK in TGF-β Signal Transduction , 1996, Science.

[48]  Paul A Clemons,et al.  The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease , 2006, Science.

[49]  Núria Queralt-Rosinach,et al.  DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes , 2015, Database J. Biol. Databases Curation.

[50]  P. MacDonald,et al.  SUMOylation regulates Kv2.1 and modulates pancreatic β-cell excitability , 2009, Journal of Cell Science.

[51]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[52]  C. Mattingly,et al.  The Comparative Toxicogenomics Database (CTD). , 2003, Environmental health perspectives.

[53]  E. Birney,et al.  Using human genetics to make new medicines , 2015, Nature reviews genetics.

[54]  Jean-Philippe Vert,et al.  A bagging SVM to learn from positive and unlabeled examples , 2010, Pattern Recognit. Lett..

[55]  Mingming Jia,et al.  COSMIC: exploring the world's knowledge of somatic mutations in human cancer , 2014, Nucleic Acids Res..

[56]  David C. Wilson,et al.  Genome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease , 2016, Nature Genetics.

[57]  Kurt Hornik,et al.  Misc Functions of the Department of Statistics, ProbabilityTheory Group (Formerly: E1071), TU Wien , 2015 .

[58]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[59]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[60]  Xiaoli Li,et al.  Ensemble Positive Unlabeled Learning for Disease Gene Identification , 2014, PloS one.

[61]  Jiuyong Li,et al.  DrugMiner: comparative analysis of machine learning algorithms for prediction of potential druggable proteins. , 2016, Drug discovery today.

[62]  S. Kash,et al.  Kv2.1 ablation alters glucose-induced islet electrical activity, enhancing insulin secretion. , 2007, Cell metabolism.

[63]  Characterisation and expression analysis of the WDR9 gene, located in the Down critical region-2 of the human chromosome 21. , 2002, Biochimica et biophysica acta.

[64]  Gautier Koscielny,et al.  Open Targets: a platform for therapeutic target identification and validation , 2016, Nucleic Acids Res..

[65]  P.A.C.R. Costa,et al.  A machine learning approach for genome-wide prediction of morbid and druggable human genes based on systems-level data , 2010, BMC Genomics.

[66]  R. Vandenbroucke,et al.  Is there new hope for therapeutic matrix metalloproteinase inhibition? , 2014, Nature Reviews Drug Discovery.

[67]  Sorin Draghici,et al.  Machine Learning and Its Applications to Biology , 2007, PLoS Comput. Biol..

[68]  Paul Workman,et al.  Distinctive Behaviors of Druggable Proteins in Cellular Networks , 2015, PLoS Comput. Biol..

[69]  Zhanchao Li,et al.  Large-scale identification of potential drug targets based on the topological features of human protein-protein interaction network. , 2015, Analytica chimica acta.

[70]  Anna Zhukova,et al.  Modeling sample variables with an Experimental Factor Ontology , 2010, Bioinform..

[71]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[72]  R. M. Owen,et al.  An analysis of the attrition of drug candidates from four major pharmaceutical companies , 2015, Nature Reviews Drug Discovery.

[73]  Robert M. Plenge,et al.  Disciplined approach to drug discovery and early development , 2016, Science Translational Medicine.

[74]  Johan A. K. Suykens,et al.  A robust ensemble approach to learn from positive and unlabeled data using SVM base models , 2014, Neurocomputing.

[75]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[76]  Mulin Jun Li,et al.  Nature Genetics Advance Online Publication a N a Ly S I S the Support of Human Genetic Evidence for Approved Drug Indications , 2022 .

[77]  Yanli Wang,et al.  FSelector: a Ruby gem for feature selection , 2012, Bioinform..

[78]  J. Schimenti,et al.  JCB_201404109 1..17 , 2014 .

[79]  J. Arrowsmith,et al.  Trial Watch: Phase II and Phase III attrition rates 2011–2012 , 2013, Nature Reviews Drug Discovery.

[80]  Gisbert Schneider,et al.  Deep Learning in Drug Discovery , 2016, Molecular informatics.

[81]  Ren-Jye Lin,et al.  Rab18 Facilitates Dengue Virus Infection by Targeting Fatty Acid Synthase to Sites of Viral Replication , 2014, Journal of Virology.

[82]  O. Stegle,et al.  Deep learning for computational biology , 2016, Molecular systems biology.

[83]  D. Johnston,et al.  Gene expression profiling and its practice in drug development. , 2007, Current genomics.

[84]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[85]  Alex Zhavoronkov,et al.  Applications of Deep Learning in Biomedicine. , 2016, Molecular pharmaceutics.

[86]  Y. Moreau,et al.  Finding the targets of a drug by integration of gene expression data with a protein interaction network. , 2013, Molecular bioSystems.

[87]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.