Data-driven formulation of natural laws by recursive-LASSO-based symbolic regression

Discovery of new natural laws has for a long time relied on the inspiration of some genius. Recently, however, machine learning technologies, which analyze big data without human prejudice and bias, are expected to find novel natural laws. Here we demonstrate that our proposed machine learning, recursive-LASSO-based symbolic (RLS) regression, enables data-driven formulation of natural laws from noisy data. The RLS regression recurrently repeats feature generation and feature selection, eventually constructing a data-driven model with highly nonlinear features. This data-driven formulation method is quite general and thus can discover new laws in various scientific fields.

[1]  Kalyan Veeramachaneni,et al.  Building Predictive Models via Feature Synthesis , 2015, GECCO.

[2]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[3]  Hans-Michael Müller,et al.  The Neuroscience Information Framework: A Data and Knowledge Environment for Neuroscience , 2008, Neuroinformatics.

[4]  Max Tegmark,et al.  AI Feynman: A physics-inspired method for symbolic regression , 2019, Science Advances.

[5]  J. Vybíral,et al.  Big data of materials science: critical role of the descriptor. , 2014, Physical review letters.

[6]  Randall K. McRee,et al.  Symbolic regression using nearest neighbor indexing , 2010, GECCO '10.

[7]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[8]  S. Brunton,et al.  Discovering governing equations from data by sparse identification of nonlinear dynamical systems , 2015, Proceedings of the National Academy of Sciences.

[9]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[10]  Claudia Draxl,et al.  The NOMAD laboratory: from data sharing to artificial intelligence , 2019, Journal of Physics: Materials.

[11]  Steven L Brunton,et al.  Sparse identification of nonlinear dynamics for rapid model recovery. , 2018, Chaos.

[12]  Dominic P. Searson,et al.  GPTIPS: An Open Source Genetic Programming Toolbox For Multigene Symbolic Regression , 2010 .

[13]  Ekaterina Vladislavleva,et al.  Separating the wheat from the chaff: on feature selection and feature importance in regression random forests and symbolic regression , 2011, GECCO.

[14]  Jürg Bähler,et al.  PomBase: a comprehensive online resource for fission yeast , 2011, Nucleic Acids Res..

[15]  Anand Chandrasekaran,et al.  Polymer Genome: A Data-Powered Polymer Informatics Platform for Property Predictions , 2018, The Journal of Physical Chemistry C.

[16]  Kristin A. Persson,et al.  Commentary: The Materials Project: A materials genome approach to accelerating materials innovation , 2013 .

[17]  Hod Lipson,et al.  Distilling Free-Form Natural Laws from Experimental Data , 2009, Science.

[18]  Ralf Herwig,et al.  ConsensusPathDB: toward a more complete picture of cell biology , 2010, Nucleic Acids Res..

[19]  Giacomo Luchetta,et al.  A New Tool in the Box? , 2015, European Journal of Risk Regulation.

[20]  Trent McConaghy,et al.  FFX: Fast, Scalable, Deterministic Symbolic Regression Technology , 2011 .

[21]  Evelyn Camon,et al.  The EMBL Nucleotide Sequence Database , 2000, Nucleic Acids Res..

[22]  Joaquin F. Rodriguez-Nieva,et al.  Identifying topological order through unsupervised machine learning , 2018, Nature Physics.