Simulation-assisted machine learning

Abstract Motivation In a predictive modeling setting, if sufficient details of the system behavior are known, one can build and use a simulation for making predictions. When sufficient system details are not known, one typically turns to machine learning, which builds a black-box model of the system using a large dataset of input sample features and outputs. We consider a setting which is between these two extremes: some details of the system mechanics are known but not enough for creating simulations that can be used to make high quality predictions. In this context we propose using approximate simulations to build a kernel for use in kernelized machine learning methods, such as support vector machines. The results of multiple simulations (under various uncertainty scenarios) are used to compute similarity measures between every pair of samples: sample pairs are given a high similarity score if they behave similarly under a wide range of simulation parameters. These similarity values, rather than the original high dimensional feature data, are used to build the kernel. Results We demonstrate and explore the simulation-based kernel (SimKern) concept using four synthetic complex systems—three biologically inspired models and one network flow optimization model. We show that, when the number of training samples is small compared to the number of features, the SimKern approach dominates over no-prior-knowledge methods. This approach should be applicable in all disciplines where predictive models are sought and informative yet approximate simulations are available. Availability and implementation The Python SimKern software, the demonstration models (in MATLAB, R), and the datasets are available at https://github.com/davidcraft/SimKern. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  L. Pusztai,et al.  Cancer heterogeneity: implications for targeted therapeutics , 2013, British Journal of Cancer.

[2]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[3]  Jiajun Wu,et al.  Galileo: Perceiving Physical Object Properties by Integrating a Physics Engine with Deep Learning , 2015, NIPS.

[4]  Anand Pratap Singh,et al.  New Approaches in Turbulence and Transition Modeling Using Data-driven Techniques , 2015 .

[5]  Jason Weston,et al.  A user's guide to support vector machines. , 2010, Methods in molecular biology.

[6]  Emmanuel Barillot,et al.  Mathematical Modelling of Molecular Pathways Enabling Tumour Cell Invasion and Migration , 2015, PLoS Comput. Biol..

[7]  Jonathan R. Karr,et al.  A Whole-Cell Computational Model Predicts Phenotype from Genotype , 2012, Cell.

[8]  Rob Fergus,et al.  Learning Physical Intuition of Block Towers by Example , 2016, ICML.

[9]  Ricardo L. Mancera,et al.  Current methods for the prediction of T‐cell epitopes , 2018 .

[10]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[11]  Luis C. Santos,et al.  An Integrated Mechanistic Model of Pan-Cancer Driver Pathways Predicts Stochastic Proliferation and Death , 2017, bioRxiv.

[12]  Karen Cichowski,et al.  Drug-Induced Death Signaling Strategy Rapidly Predicts Cancer Response to Chemotherapy , 2015, Cell.

[13]  Jaap Molenaar,et al.  A Quantitative and Dynamic Model of the Arabidopsis Flowering Time Gene Regulatory Network , 2015, PloS one.

[14]  M. Berger,et al.  Patient HLA class I genotype influences cancer response to checkpoint blockade immunotherapy , 2018, Science.

[15]  Gary Tan,et al.  Predictive Simulation of Public Transportation Using Deep Learning , 2018 .

[16]  H. Glahn,et al.  The Use of Model Output Statistics (MOS) in Objective Weather Forecasting , 1972 .

[17]  P. Blanchard,et al.  Treatment de-escalation for HPV-driven oropharyngeal cancer: Where do we stand? , 2017, Clinical and translational radiation oncology.

[18]  Toniann Pitassi,et al.  The reusable holdout: Preserving validity in adaptive data analysis , 2015, Science.

[19]  John N. Tsitsiklis,et al.  Introduction to linear optimization , 1997, Athena scientific optimization and computation series.

[20]  Erwan Scornet,et al.  Impact of subsampling and pruning on random forests , 2016, 1603.04261.

[21]  Oliviero Carugo,et al.  Data Mining Techniques for the Life Sciences , 2009, Methods in Molecular Biology.

[22]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[23]  Charu C. Aggarwal,et al.  Similarity Forests , 2017, KDD.

[24]  J. Gray,et al.  The genetics and genomics of cancer , 2003, Nature Genetics.

[25]  Stephen D. Larson,et al.  OpenWorm: an open-science approach to modeling Caenorhabditis elegans , 2014, Front. Comput. Neurosci..

[26]  L. Vermeulen,et al.  Cancer heterogeneity—a multifaceted view , 2013, EMBO reports.

[27]  David E. Gloriam,et al.  Pharmacogenomics of GPCR Drug Targets , 2018, Cell.

[28]  S. Kung Kernel Methods and Machine Learning , 2014 .

[29]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[30]  O. Lund,et al.  NetMHCpan, a Method for Quantitative Predictions of Peptide Binding to Any HLA-A and -B Locus Protein of Known Sequence , 2007, PloS one.

[31]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[32]  Roberto Natalini,et al.  The p53 protein and its molecular network: modelling a missing link between DNA damage and cell fate. , 2014, Biochimica et biophysica acta.

[33]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[34]  David Craft,et al.  The value of prior knowledge in machine learning of complex network systems , 2016, bioRxiv.