Algorithmic Probability-guided Supervised Machine Learning on Non-differentiable Spaces

We show how complexity theory can be introduced in machine learning to help bring together apparently disparate areas of current research. We show that this new approach requires less training data and is more generalizable as it shows greater resilience to random attacks. We investigate the shape of the discrete algorithmic space when performing regression or classification using a loss function parametrized by algorithmic complexity, demonstrating that the property of differentiation is not necessary to achieve results similar to those obtained using differentiable programming approaches such as deep learning. In doing so we use examples which enable the two approaches to be compared (small, given the computational power required for estimations of algorithmic complexity). We find and report that (i) machine learning can successfully be performed on a non-smooth surface using algorithmic complexity; (ii) that parameter solutions can be found using an algorithmic-probability classifier, establishing a bridge between a fundamentally discrete theory of computability and a fundamentally continuous mathematical theory of optimization methods; (iii) a formulation of an algorithmically directed search technique in non-smooth manifolds can be defined and conducted; (iv) exploitation techniques and numerical methods for algorithmic search to navigate these discrete non-differentiable spaces can be performed; in application of the (a) identification of generative rules from data observations; (b) solutions to image classification problems more resilient against pixel attacks compared to neural networks; (c) identification of equation parameters from a small data-set in the presence of noise in continuous ODE system problem, (d) classification of Boolean NK networks by (1) network topology, (2) underlying Boolean function, and (3) number of incoming edges.

[1]  Jean-Paul Delahaye,et al.  On the Algorithmic Nature of the World , 2009, ArXiv.

[2]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Ray J. Solomonoff,et al.  The Kolmogorov Lecture* The Universal Distribution and Machine Learning , 2003, Comput. J..

[5]  B. Sinha,et al.  Statistical Meta-Analysis with Applications , 2008 .

[6]  Hector Zenil Compression is Comprehension, and the Unreasonable Effectiveness of Digital Computation in the Natural World , 2019, ArXiv.

[7]  Gregory J. Chaitin,et al.  Algorithmic Information Theory , 1987, IBM J. Res. Dev..

[8]  R. Solomonoff A PRELIMINARY REPORT ON A GENERAL THEORY OF INDUCTIVE INFERENCE , 2001 .

[9]  Hector Zenil,et al.  Causal deconvolution by algorithmic generative models , 2019, Nature Machine Intelligence.

[10]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[11]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[12]  Lance Fortnow,et al.  Sophistication Revisited , 2003, Theory of Computing Systems.

[13]  Hector Zenil,et al.  Une approche expérimentale à la théorie algorithmique de la complexité , 2011 .

[14]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[15]  Gregory J. Chaitin,et al.  On the Length of Programs for Computing Finite Binary Sequences: statistical considerations , 1969, JACM.

[16]  Maxim Teslenko,et al.  Kauffman networks: analysis and applications , 2005, ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005..

[17]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[18]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[19]  S. Kauffman Metabolic stability and epigenesis in randomly constructed genetic nets. , 1969, Journal of theoretical biology.

[20]  Kouichi Sakurai,et al.  One Pixel Attack for Fooling Deep Neural Networks , 2017, IEEE Transactions on Evolutionary Computation.

[21]  Jeffrey Shallit,et al.  Proving Darwin: Making Biology Mathematical , 2012 .

[22]  Hector Zenil,et al.  Algorithmically probable mutations reproduce aspects of evolution, such as convergence rate, genetic memory and modularity , 2017, Royal Society Open Science.

[23]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[24]  L. Armijo Minimization of functions having Lipschitz continuous first partial derivatives. , 1966 .

[25]  F. Fogelman-Soulié,et al.  Random Boolean Networks , 1981 .

[26]  Jean-Paul Delahaye,et al.  Numerical evaluation of algorithmic complexity for short strings: A glance into the innermost structure of randomness , 2011, Appl. Math. Comput..

[27]  Hector Zenil,et al.  Coding-theorem like behaviour and emergence of the universal distribution from resource-bounded algorithmic probability , 2017, Int. J. Parallel Emergent Distributed Syst..

[28]  Gregory J. Chaitin Evolution of Mutating Software , 2009, Bull. EATCS.

[29]  Moshe Koppel Learning to predict non-deterministically generated strings , 1991, Mach. Learn..

[30]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[31]  Stephen Wolfram,et al.  A New Kind of Science , 2003, Artificial Life.

[32]  Chico Q. Camargo,et al.  Input–output maps are strongly biased towards simple outputs , 2018, Nature Communications.

[33]  Jean-Paul Delahaye,et al.  Calculating Kolmogorov Complexity from the Output Frequency Distributions of Small Turing Machines , 2012, PloS one.

[34]  L. Deng,et al.  The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web] , 2012, IEEE Signal Processing Magazine.

[35]  Moshe Koppel,et al.  An almost machine-independent theory of program-length complexity, sophistication, and induction , 1991, Inf. Sci..

[36]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[37]  Jean-Paul Delahaye,et al.  Two-dimensional Kolmogorov complexity and an empirical validation of the Coding theorem method by compressibility , 2012, PeerJ Comput. Sci..

[38]  Ray J. Solomonoff,et al.  The Application of Algorithmic Probability to Problems in Artificial Intelligence , 1985, UAI.

[39]  V. Dua,et al.  A Simultaneous Approach for Parameter Estimation of a System of Ordinary Differential Equations, Using Artificial Neural Network Approximation , 2012 .

[40]  A Tikhonov,et al.  Solution of Incorrectly Formulated Problems and the Regularization Method , 1963 .

[41]  Hector Zenil,et al.  A Decomposition Method for Global Evaluation of Shannon Entropy and Local Estimations of Algorithmic Complexity , 2016, Entropy.