Theory-Guided Data Science: A New Paradigm for Scientific Discovery from Data

Data science models, although successful in a number of commercial domains, have had limited applicability in scientific problems involving complex physical phenomena. Theory-guided data science (TGDS) is an emerging paradigm that aims to leverage the wealth of scientific knowledge for improving the effectiveness of data science models in enabling scientific discovery. The overarching vision of TGDS is to introduce scientific consistency as an essential component for learning generalizable models. Further, by producing scientifically interpretable models, TGDS aims to advance our scientific understanding by discovering novel domain insights. Indeed, the paradigm of TGDS has started to gain prominence in a number of scientific disciplines such as turbulence modeling, material discovery, quantum chemistry, bio-medical science, bio-marker discovery, climate science, and hydrology. In this paper, we formally conceptualize the paradigm of TGDS and present a taxonomy of research themes in TGDS. We describe several approaches for integrating domain knowledge in different research themes using illustrative examples from different disciplines. We also highlight some of the promising avenues of novel research for realizing the full potential of theory-guided data science.

[1]  V. Hayward,et al.  Big data: The next Google , 2008, Nature.

[2]  A. Bevan The data deluge , 2015, Antiquity.

[3]  Gareth M. James,et al.  A generalized Dantzig selector with shrinkage tuning , 2009 .

[4]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[5]  James M. Rondinelli,et al.  Theory-Guided Machine Learning in Materials Science , 2016, Front. Mater..

[6]  Pierre Baldi,et al.  Synergies Between Quantum Mechanics and Machine Learning in Reaction Prediction , 2016, J. Chem. Inf. Model..

[7]  Vipin Kumar,et al.  Post Classification Label Refinement Using Implicit Ordering Constraint Among Data Instances , 2015, 2015 IEEE International Conference on Data Mining.

[8]  Gerbrand Ceder,et al.  Predicting crystal structure by merging data mining with quantum mechanics , 2006, Nature materials.

[9]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[10]  Sophie Ahrens,et al.  Recommender Systems , 2012 .

[11]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .

[12]  Karthik Duraisamy,et al.  Machine Learning-augmented Predictive Modeling of Turbulent Separated Flows over Airfoils , 2016, ArXiv.

[13]  Jean-François Boulicaut,et al.  Constraint-based Data Mining , 2005, Data Mining and Knowledge Discovery Handbook.

[14]  Paul Resnick,et al.  Recommender systems , 1997, CACM.

[15]  Keith Beven,et al.  The future of distributed models: model calibration and uncertainty prediction. , 1992 .

[16]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[17]  Geoffrey C. Fox,et al.  Examining the Challenges of Scientific Workflows , 2007, Computer.

[18]  Anuj Karpatne,et al.  Predictive Learning in the Presence of Heterogeneity and Limited Training Data , 2014, SDM.

[19]  Jean-Philippe Vert,et al.  Group lasso with overlap and graph lasso , 2009, ICML '09.

[20]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[21]  K. Müller,et al.  Fast and accurate modeling of molecular atomization energies with machine learning. , 2011, Physical review letters.

[22]  Andrew J. Majda,et al.  Physics constrained nonlinear regression models for time series , 2012 .

[23]  Kai Wang,et al.  Accounting for linkage disequilibrium in genome-wide association studies: A penalized regression method. , 2013, Statistics and its interface.

[24]  Jian Pei,et al.  Constrained frequent pattern mining: a pattern-growth view , 2002, SKDD.

[25]  Jinlong Wu,et al.  Physics-informed machine learning approach for reconstructing Reynolds stress modeling discrepancies based on DNS data , 2016, 1606.07987.

[26]  Nagiza F. Samatova,et al.  Theory-Guided Data Science for Climate Change , 2014, Computer.

[27]  P. Hohenberg,et al.  Inhomogeneous Electron Gas , 1964 .

[28]  Joel Z. Leibo,et al.  View-Tolerant Face Recognition and Hebbian Learning Imply Mirror-Symmetric Neural Tuning to Head Orientation , 2016, Current Biology.

[29]  James H. Faghmous,et al.  A Big Data Guide to Understanding Climate Change: The Case for Theory-Guided Data Science , 2014, Big Data.

[30]  Heng Xiao,et al.  Physics-Informed Machine Learning for Predictive Turbulence Modeling: Using Data to Improve RANS Modeled Reynolds Stresses , 2016 .

[31]  Li Li,et al.  Understanding Machine-learned Density Functionals , 2014, ArXiv.

[32]  Shashi Shekhar,et al.  Computing and Climate , 2015, Comput. Sci. Eng..

[33]  Michael I. Jordan,et al.  Machine learning: Trends, perspectives, and prospects , 2015, Science.

[34]  Andrew J. Majda,et al.  FUNDAMENTAL LIMITATIONS OF AD HOC LINEAR AND QUADRATIC MULTI-LEVEL REGRESSION MODELS FOR PHYSICAL SYSTEMS , 2012 .

[35]  Wei-keng Liao,et al.  Toward enhanced understanding and projections of climate extremes using physics-guided data mining techniques , 2014 .

[36]  D. Lazer,et al.  The Parable of Google Flu: Traps in Big Data Analysis , 2014, Science.

[37]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[38]  B. Roe,et al.  Boosted decision trees as an alternative to artificial neural networks for particle identification , 2004, physics/0408124.

[39]  J Anthony Movshon,et al.  Putting big data to good use in neuroscience , 2014, Nature Neuroscience.

[40]  Linwei Wang,et al.  Robust Transmural Electrophysiological Imaging: Integrating Sparse and Dynamic Physiological Models into ECG-Based Inference , 2015, MICCAI.

[41]  Charles Anderson,et al.  The end of theory: The data deluge makes the scientific method obsolete , 2008 .

[42]  Firdaus Janoos,et al.  Multi-scale Graphical Models for Spatio-Temporal Processes , 2014, NIPS.

[43]  Nagiza F. Samatova,et al.  A graph‐based approach to find teleconnections in climate data , 2013, Stat. Anal. Data Min..

[44]  Arindam Banerjee,et al.  Generalized Dantzig Selector: Application to the k-support norm , 2014, NIPS.

[45]  Vipin Kumar,et al.  EddyScan: A physically consistent ocean eddy monitoring application , 2012, 2012 Conference on Intelligent Data Understanding.

[46]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[47]  Snigdhansu Chatterjee,et al.  Sparse Group Lasso: Consistency and Climate Applications , 2012, SDM.

[48]  Xi Chen,et al.  Global Monitoring of Inland Water Dynamics: State-of-the-Art, Challenges, and Opportunities , 2016, Computational Sustainability.

[49]  Gordon Bell,et al.  Beyond the Data Deluge , 2009, Science.

[50]  Gregory R. Grant,et al.  Bioinformatics - The Machine Learning Approach , 2000, Comput. Chem..

[51]  Vipin Kumar,et al.  Predict Land Covers with Transition Modeling and Incremental Learning , 2017, SDM.

[52]  Anubhav Jain,et al.  Finding Nature’s Missing Ternary Oxide Compounds Using Machine Learning and Density Functional Theory , 2010 .

[53]  Karthik Duraisamy,et al.  A paradigm for data-driven predictive modeling using field inversion and machine learning , 2016, J. Comput. Phys..

[54]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[55]  Anuj Karpatne,et al.  BHPMF – a hierarchical Bayesian approach to gap-filling and trait prediction for macroecology and functional biogeography , 2015 .

[56]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[57]  R. Tibshirani,et al.  A note on the group lasso and a sparse group lasso , 2010, 1001.0736.

[58]  Daniel Choquet,et al.  The data deluge , 2012, Nature Cell Biology.

[59]  Zhe Jiang,et al.  Monitoring Land-Cover Changes: A Machine-Learning Perspective , 2016, IEEE Geoscience and Remote Sensing Magazine.

[60]  J. Templeton Evaluation of machine learning algorithms for prediction of regions of high Reynolds averaged Navier Stokes uncertainty , 2015 .

[61]  Michela Paganini,et al.  CaloGAN: Simulating 3D High Energy Particle Showers in Multi-Layer Electromagnetic Calorimeters with Generative Adversarial Networks , 2017, ArXiv.

[62]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[63]  Marco Buongiorno Nardelli,et al.  The high-throughput highway to computational materials design. , 2013, Nature materials.

[64]  S. Parthasarathy,et al.  Automated Decision Tree Classification of Corneal Shape , 2005, Optometry and Vision Science.

[65]  Vipin Kumar,et al.  Change Detection from Temporal Sequences of Class Labels: Application to Land Cover Change Mapping , 2013, SDM.

[66]  James H. Faghmous,et al.  A daily global mesoscale ocean eddy dataset from satellite altimetry , 2015, Scientific Data.

[67]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[68]  Vipin Kumar,et al.  Spatio-Temporal Consistency as a Means to Identify Unlabeled Objects in a Continuous Data Field , 2014, AAAI.

[69]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[70]  Pengcheng Shi,et al.  Active Model with Orthotropic Hyperelastic Material for Cardiac Image Analysis , 2009, FIMH.

[71]  A. Karpatne,et al.  An approach for global monitoring of surface water extent variations in reservoirs using MODIS data , 2017 .

[72]  Vasant Honavar,et al.  The Promise and Potential of Big Data: A Case for Discovery Informatics , 2014 .

[73]  Robert D. Kleinberg Nearly Tight Bounds for the Continuum-Armed Bandit Problem , 2004, NIPS.

[74]  R. Agrawal The Continuum-Armed Bandit Problem , 1995 .

[75]  Davide Castelvecchi,et al.  Artificial intelligence called in to tackle LHC data deluge , 2015, Nature.

[76]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[77]  Eric P. Xing,et al.  Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity , 2009, ICML.

[78]  S. Higgins,et al.  TRY – a global database of plant traits , 2011, Global Change Biology.

[79]  Mario Schirmer,et al.  Subsurface flow contribution in the hydrological cycle: lessons learned and challenges ahead—a review , 2013, Environmental Earth Sciences.

[80]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[81]  T. Wigley,et al.  Statistical downscaling of general circulation model output: A comparison of methods , 1998 .

[82]  Benjamin Nachman,et al.  Accelerating Science with Generative Adversarial Networks: An Application to 3D Particle Showers in Multilayer Calorimeters. , 2017, Physical review letters.

[83]  G. Evensen Data Assimilation: The Ensemble Kalman Filter , 2006 .

[84]  Mario Putti,et al.  Physically based modeling in catchment hydrology at 50: Survey and outlook , 2015 .

[85]  Marc F. P. Bierkens,et al.  Global hydrology 2015: State, trends, and directions , 2015 .

[86]  B. Santer,et al.  Statistical significance of climate sensitivity predictors obtained by data mining , 2014 .

[87]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.