Bayesian additive adaptive basis tensor product models for modeling high dimensional surfaces: an application to high‐throughput toxicity testing

Many modern datasets are sampled with error from complex high-dimensional surfaces. Methods such as tensor product splines or Gaussian processes are effective and well suited for characterizing a surface in two or three dimensions, but they may suffer from difficulties when representing higher dimensional surfaces. Motivated by high throughput toxicity testing where observed dose-response curves are cross sections of a surface defined by a chemical's structural properties, a model is developed to characterize this surface to predict untested chemicals' dose-responses. This manuscript proposes a novel approach that models the multidimensional surface as a sum of learned basis functions formed as the tensor product of lower dimensional functions, which are themselves representable by a basis expansion learned from the data. The model is described and a Gibbs sampling algorithm is proposed. The approach is investigated in a simulation study and through data taken from the US EPA's ToxCast high throughput toxicity testing platform.

[1]  D. Dunson,et al.  Kernel stick-breaking processes. , 2008, Biometrika.

[2]  Robert B. Gramacy,et al.  tgp: An R Package for Bayesian Nonstationary, Semiparametric Nonlinear Regression and Design by Treed Gaussian Process Models , 2007 .

[3]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[4]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[5]  Hua Li,et al.  Ensembling Neural Networks-based 3d model retrieval , 2008, 2008 Third International Conference on Pervasive Computing and Applications.

[6]  Guillermo Sapiro,et al.  Dictionary learning and sparse coding for unsupervised clustering , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Rudrasis Chakraborty,et al.  Dictionary Learning and Sparse Coding on Statistical Manifolds , 2018, ArXiv.

[8]  James G. Scott,et al.  Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables , 2012, 1205.0310.

[9]  Frank R. Burden,et al.  Quantitative Structure-Activity Relationship Studies Using Gaussian Processes , 2001, J. Chem. Inf. Comput. Sci..

[10]  Carl de Boor,et al.  A Practical Guide to Splines , 1978, Applied Mathematical Sciences.

[11]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[12]  Robert B. Gramacy,et al.  Ja n 20 08 Bayesian Treed Gaussian Process Models with an Application to Computer Modeling , 2009 .

[13]  Carl E. Rasmussen,et al.  A Unifying View of Sparse Approximate Gaussian Process Regression , 2005, J. Mach. Learn. Res..

[14]  Ulf Norinder,et al.  Support vector machine models in drug design: applications to drug transport processes and QSAR using simplex optimisations and variable selection , 2003, Neurocomputing.

[15]  D. Higdon Space and Space-Time Modeling using Process Convolutions , 2002 .

[16]  Frédéric Ferraty,et al.  Nonparametric Functional Data Analysis: Theory and Practice (Springer Series in Statistics) , 2006 .

[17]  Anders Krogh,et al.  Learning with ensembles: How overfitting can be useful , 1995, NIPS.

[18]  Max Kuhn,et al.  caret: Classification and Regression Training , 2015 .

[19]  Encoding Rules,et al.  SMILES, a Chemical Language and Information System. 1. Introduction to Methodology , 1988 .

[20]  Wei Tang,et al.  Ensembling neural networks: Many could be better than all , 2002, Artif. Intell..

[21]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[22]  Dirk Eddelbuettel,et al.  Seamless R and C++ Integration with Rcpp , 2013 .

[23]  Weida Tong,et al.  Mold2, Molecular Descriptors from 2D Structures for Chemoinformatics and Toxicoinformatics , 2008, J. Chem. Inf. Model..

[24]  P. Hall,et al.  Achieving near perfect classification for functional data , 2012 .

[25]  James Devillers,et al.  Neural Networks in QSAR and Drug Design , 1996 .

[26]  Jeffrey S. Morris Functional Regression , 2014, 1406.4068.

[27]  Peter Hall,et al.  A Functional Data—Analytic Approach to Signal Discrimination , 2001, Technometrics.

[28]  R. Czerminski,et al.  Use of Support Vector Machine in Pattern Classification: Application to QSAR Studies , 2001 .

[29]  B. Silverman,et al.  Functional Data Analysis , 1997 .

[30]  D. Dunson,et al.  Sparse Bayesian infinite factor models. , 2011, Biometrika.

[31]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[32]  Alan E. Gelfand,et al.  The Dirichlet labeling process for clustering functional data , 2011 .

[33]  Edwin V. Bonilla,et al.  Multi-task Gaussian Process Prediction , 2007, NIPS.

[34]  Aurore Delaigle,et al.  Classification Using Censored Functional Data , 2013 .

[35]  Ana-Maria Staicu,et al.  Functional Additive Mixed Models , 2012, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[36]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[37]  Torsten Hothorn,et al.  The functional linear array model , 2015 .

[38]  Tian Xia,et al.  A Bayesian regression tree approach to identify the effect of nanoparticles’ properties on toxicity profiles , 2015, 1506.00403.

[39]  D L Massart,et al.  Classification of drugs in absorption classes using the classification and regression trees (CART) methodology. , 2005, Journal of pharmaceutical and biomedical analysis.

[40]  Baoxin Li,et al.  Discriminative K-SVD for dictionary learning in face recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[41]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[42]  S. Wood Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models , 2011 .

[43]  Brian Neelon,et al.  Bayesian Latent Factor Regression for Functional and Longitudinal Data , 2012, Biometrics.

[44]  Conrad Sanderson,et al.  RcppArmadillo: Accelerating R with high-performance C++ linear algebra , 2014, Comput. Stat. Data Anal..

[45]  Danail Bonchev,et al.  Statistical modelling of molecular descriptors in QSAR/QSPR , 2012 .

[46]  D. Dunson,et al.  Efficient Gaussian process regression for large datasets. , 2011, Biometrika.

[47]  David M. Reif,et al.  In Vitro Screening of Environmental Chemicals for Targeted Testing Prioritization: The ToxCast Project , 2009, Environmental health perspectives.