Predicting clinically promising therapeutic hypotheses using tensor factorization

BackgroundDetermining which target to pursue is a challenging and error-prone first step in developing a therapeutic treatment for a disease, where missteps are potentially very costly given the long-time frames and high expenses of drug development. With current informatics technology and machine learning algorithms, it is now possible to computationally discover therapeutic hypotheses by predicting clinically promising drug targets based on the evidence associating drug targets with disease indications. We have collected this evidence from Open Targets and additional databases that covers 17 sources of evidence for target-indication association and represented the data as a tensor of 21,437 × 2211 × 17.ResultsAs a proof-of-concept, we identified examples of successes and failures of target-indication pairs in clinical trials across 875 targets and 574 disease indications to build a gold-standard data set of 6140 known clinical outcomes. We designed and executed three benchmarking strategies to examine the performance of multiple machine learning models: Logistic Regression, LASSO, Random Forest, Tensor Factorization and Gradient Boosting Machine. With 10-fold cross-validation, tensor factorization achieved AUROC = 0.82 ± 0.02 and AUPRC = 0.71 ± 0.03. Across multiple validation schemes, this was comparable or better than other methods.ConclusionIn this work, we benchmarked a machine learning technique called tensor factorization for the problem of predicting clinical outcomes of therapeutic hypotheses. Results have shown that this method can achieve equal or better prediction performance compared with a variety of baseline models. We demonstrate one application of the method to predict outcomes of trials on novel indications of approved drug targets. This work can be expanded to targets and indications that have never been clinically tested and proposing novel target-indication hypotheses. Our proposed biologically-motivated cross-validation schemes provide insight into the robustness of the prediction performance. This has significant implications for all future methods that try to address this seminal problem in drug discovery.

[1]  Fei Wang,et al.  Tensor factorization toward precision medicine , 2016, Briefings Bioinform..

[2]  Y. Moreau,et al.  Computational tools for prioritizing candidate genes: boosting disease gene discovery , 2012, Nature Reviews Genetics.

[3]  Chee Keong Kwoh,et al.  Positive-unlabeled learning for disease gene identification , 2012, Bioinform..

[4]  Paul Workman,et al.  Distinctive Behaviors of Druggable Proteins in Cellular Networks , 2015, PLoS Comput. Biol..

[5]  Ian M. Donaldson,et al.  Effects of protein interaction data integration, representation and reliability on the use of network properties for drug target prediction , 2012, BMC Bioinformatics.

[6]  Pietro Liò,et al.  The BioMart community portal: an innovative alternative to large, centralized data repositories , 2015, Nucleic Acids Res..

[7]  Akimichi Morita,et al.  Serum interleukin-6 levels in response to biologic treatment in patients with psoriasis , 2017, Modern rheumatology.

[8]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[9]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[10]  A. Barabasi,et al.  Uncovering disease-disease relationships through the incomplete interactome , 2015, Science.

[11]  Yves Moreau,et al.  Macau: Scalable Bayesian Multi-relational Factorization with Side Information using MCMC , 2015, 1509.04610.

[12]  Andrew J. Doig,et al.  Properties of Protein Drug Target Classes , 2015, PloS one.

[13]  Georg Nickenig,et al.  Angiotensin II Type 1 Receptor Antagonism Improves Hypercholesterolemia-Associated Endothelial Dysfunction , 2002, Arteriosclerosis, thrombosis, and vascular biology.

[14]  Michael Hay,et al.  Clinical development success rates for investigational drugs , 2014, Nature Biotechnology.

[15]  George Papadatos,et al.  The ChEMBL bioactivity database: an update , 2013, Nucleic Acids Res..

[16]  Alexander E. Ivliev,et al.  Drug Target Prediction and Repositioning Using an Integrated Network-Based Approach , 2013, PloS one.

[17]  Pankaj Agarwal,et al.  Systematic interrogation of diverse Omic data reveals interpretable, robust, and generalizable transcriptomic features of clinically successful therapeutic targets , 2017, bioRxiv.

[18]  Lei Xie,et al.  FASCINATE: Fast Cross-Layer Dependency Inference on Multi-layered Networks , 2016, KDD.

[19]  Ruslan Salakhutdinov,et al.  Probabilistic Matrix Factorization , 2007, NIPS.

[20]  P.A.C.R. Costa,et al.  A machine learning approach for genome-wide prediction of morbid and druggable human genes based on systems-level data , 2010, BMC Genomics.

[21]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[22]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[23]  A. Barabasi,et al.  Network medicine : a network-based approach to human disease , 2010 .

[24]  R. González-Sarmiento,et al.  Association of IL1Β (-511 A/C) and IL6 (-174 G > C) polymorphisms with higher disease activity and clinical pattern of psoriatic arthritis , 2016, Clinical Rheumatology.

[25]  Michael R. Lyu,et al.  SoRec: social recommendation using probabilistic matrix factorization , 2008, CIKM '08.

[26]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[27]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[28]  Evgeniy Gabrilovich,et al.  A Review of Relational Machine Learning for Knowledge Graphs , 2015, Proceedings of the IEEE.

[29]  R. Harrison,et al.  Phase II and phase III failures: 2013–2015 , 2016, Nature Reviews Drug Discovery.

[30]  Mulin Jun Li,et al.  Nature Genetics Advance Online Publication a N a Ly S I S the Support of Human Genetic Evidence for Approved Drug Indications , 2022 .

[31]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[32]  M. Carson,et al.  Network-based prediction and knowledge mining of disease genes , 2015, BMC Medical Genomics.

[33]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[34]  Ruslan Salakhutdinov,et al.  Bayesian probabilistic matrix factorization using Markov chain Monte Carlo , 2008, ICML '08.

[35]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[36]  R. Tanzi,et al.  Thirty years of Alzheimer's disease genetics: the implications of systematic meta-analyses , 2008, Nature Reviews Neuroscience.

[37]  G. von Heijne,et al.  Tissue-based map of the human proteome , 2015, Science.

[38]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[39]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[40]  D. Goldstein,et al.  Genic Intolerance to Functional Variation and the Interpretation of Personal Genomes , 2013, PLoS genetics.

[41]  Albert-László Barabási,et al.  A DIseAse MOdule Detection (DIAMOnD) Algorithm Derived from a Systematic Analysis of Connectivity Patterns of Disease Proteins in the Human Interactome , 2015, PLoS Comput. Biol..

[42]  Ulf Leser,et al.  Reflection of successful anticancer drug development processes in the literature. , 2016, Drug discovery today.

[43]  H. Kawasaki,et al.  Influence of angiotensin II type 1 receptor polymorphism on hypertension in patients with hypercholesterolemia. , 2001, Clinica chimica acta; international journal of clinical chemistry.

[44]  Andrey Rzhetsky,et al.  Quantitative systems-level determinants of human genes targeted by successful drugs. , 2008, Genome research.

[45]  M. Pangalos,et al.  Lessons learned from the fate of AstraZeneca's drug pipeline: a five-dimensional framework , 2014, Nature Reviews Drug Discovery.

[46]  Philip J Mease,et al.  The Efficacy and Safety of Clazakizumab, an Anti–Interleukin‐6 Monoclonal Antibody, in a Phase IIb Study of Adults With Active Psoriatic Arthritis , 2016, Arthritis & rheumatology.

[47]  Karen Y. Stokes,et al.  Angiotensin II Type-1 Receptor Antagonism Attenuates the Inflammatory and Thrombogenic Responses to Hypercholesterolemia in Venules , 2005, Hypertension.

[48]  James Y. Zou Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[49]  Gautier Koscielny,et al.  Open Targets: a platform for therapeutic target identification and validation , 2016, Nucleic Acids Res..

[50]  J. Arrowsmith,et al.  Trial Watch: Phase II and Phase III attrition rates 2011–2012 , 2013, Nature Reviews Drug Discovery.

[51]  Hua Xu,et al.  A comparative study of disease genes and drug targets in the human protein interactome , 2015, BMC Bioinformatics.

[52]  Xiaoli Li,et al.  Ensemble Positive Unlabeled Learning for Disease Gene Identification , 2014, PloS one.

[53]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.