MML, hybrid Bayesian network graphical models, statistical consistency, invarianc

Publisher Summary The problem of statistical—or inductive—inference pervades a large number of human activities and a large number of (human and non-human) actions requiring “intelligence.” The Minimum Message Length (MML) approach to machine learning (within artificial intelligence) and statistical (or inductive) inference gives a trade-off between simplicity of hypothesis and goodness of fit to the data. There are several different and intuitively appealing ways of thinking of MML. There are many measures of predictive accuracy. The most common form of prediction seems to be a prediction without a probability or anything else to quantify it. MML is also discussed in terms of algorithmic information theory, the shortest input to a (Universal) Turing Machine [(U)TM] or computer program which yields the original data string. This chapter sheds light on information theory, turing machines and algorithmic information theory—and relates all of these to MML. It then moves on to Ockham's razor and the distinction between inference (or induction, or explanation) and prediction.

[1]  Karl Rihaczek,et al.  1. WHAT IS DATA MINING? , 2019, Data Mining for the Social Sciences.

[2]  Tony Jebara,et al.  Machine Learning: Discriminative and Generative (Kluwer International Series in Engineering and Computer Science) , 2003 .

[3]  Ian Witten,et al.  Data Mining , 2000 .

[4]  M. Prior,et al.  Are there subgroups within the autistic spectrum? A cluster analysis of a group of children with autistic spectrum disorders. , 1998, Journal of child psychology and psychiatry, and allied disciplines.

[5]  David L. Dowe,et al.  Kinship, optimality, and typology , 2010, Behavioral and Brain Sciences.

[6]  George W. Hart Minimum information estimation of structure , 1987 .

[7]  C. S. Wallace,et al.  Intrinsic Classification of Spatially Correlated Data , 1998, Comput. J..

[8]  D L Dowe,et al.  The Melbourne Family Grief Study, II: Psychosocial morbidity and grief in bereaved families. , 1996, The American journal of psychiatry.

[9]  B. Clarke Discussion of the Papers by Rissanen, and by Wallace and Dowe , 1999, Comput. J..

[10]  C. S. Wallace,et al.  MML mixture modelling of multi-state, Poisson, von Mises circular and Gaussian distributions , 1997 .

[11]  John R. Searle,et al.  Minds, brains, and programs , 1980, Behavioral and Brain Sciences.

[12]  David L. Dowe,et al.  MML Estimation of the Parameters of the Sherical Fisher Distribution , 1996, ALT.

[13]  R. Solomonoff A PRELIMINARY REPORT ON A GENERAL THEORY OF INDUCTIVE INFERENCE , 2001 .

[14]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[15]  A. G. Oettinger,et al.  Language and information , 1968 .

[16]  Ray J. Solomonoff The discovery of algorithmic probability: A guide for the programming of true creativity , 1995, EuroCOLT.

[17]  Dana Ron,et al.  An Experimental and Theoretical Comparison of Model Selection Methods , 1995, COLT '95.

[18]  David L. Dowe,et al.  Bayes not Bust! Why Simplicity is no Problem for Bayesians1 , 2007, The British Journal for the Philosophy of Science.

[19]  Jonathan J. Oliver,et al.  MDL and MML: Similarities and differences , 1994 .

[20]  Kai Ming Ting,et al.  Model-based clustering of sequential data , 2006 .

[21]  David L. Dowe,et al.  MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions , 2000, Stat. Comput..

[22]  C. S. Wallace,et al.  Statistical and Inductive Inference by Minimum Message Length (Information Science and Statistics) , 2005 .

[23]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part II , 1964, Inf. Control..

[24]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[25]  C. S. Wallace,et al.  Classification by Minimum-Message-Length Inference , 1991, ICCI.

[26]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part I , 1964, Inf. Control..

[27]  Enes Makalic,et al.  MML Invariant Linear Regression , 2009, Australasian Conference on Artificial Intelligence.

[28]  Chris S. Wallace,et al.  The Complexity of Strict Minimum Message Length Inference , 2002, Comput. J..

[29]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[30]  C. S. Wallace,et al.  Resolving the Neyman-Scott problem by minimum message length , 1997 .

[31]  D M Boulton,et al.  The Classification of Depression by Numerical Taxonomy , 1969, British Journal of Psychiatry.

[32]  Peter C. B. Phillips,et al.  An Asymptotic Theory of Bayesian Inference for Time Series , 1996 .

[33]  R. Solomonoff Does Algorithmic Probability Solve the Problem of Induction , 2006 .

[34]  C. S. Wallace,et al.  An Information Measure for Single Link Classification , 1975, Comput. J..

[35]  David L. Dowe,et al.  Clustering of Gaussian and t distributions using Minimum Message Length , 2003 .

[36]  David L. Dowe Discussion on hedging predictions in machine learning by A Gammerman and V Vovk , 2007 .

[37]  Richard P. Brent,et al.  Some Comments on C. S. Wallace's Random Number Generators , 2008, Comput. J..

[38]  Henry Tirri,et al.  Minimum Encoding Approaches for Predictive Modeling , 1998, UAI.

[39]  G M Clark,et al.  Analysis of nasal support. , 1970, Archives of otolaryngology.

[40]  David L. Dowe,et al.  Foreword re C. S. Wallace , 2008, Comput. J..

[41]  C. S. Wallace,et al.  Occupancy of a Rectangular Array , 1973, Comput. J..

[42]  D. Dowe,et al.  An MML classification of protein structure that knows about angles and sequence. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[43]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[44]  C. S. Wallace,et al.  Unsupervised Learning Using MML , 1996, ICML.

[45]  Chris S. Wallace,et al.  A note on the comparison of polynomial selection methods , 1999, AISTATS.

[46]  Kurt Gödel,et al.  On Formally Undecidable Propositions of Principia Mathematica and Related Systems , 1966 .

[47]  David L. Dowe,et al.  Intrinsic classification by MML - the Snob program , 1994 .

[48]  Yudi Agusta,et al.  Unsupervised learning of Gamma mixture models using Minimum Message Length , 2003 .

[49]  David L. Dowe,et al.  MML Inference of Oblique Decision Trees , 2004, Australian Conference on Artificial Intelligence.

[50]  David L. Dowe,et al.  Enhancing MML Clustering Using Context Data with Climate Applications , 2009, Australasian Conference on Artificial Intelligence.

[51]  M. Tribus,et al.  Probability theory: the logic of science , 2003 .

[52]  Ray J. Solomonoff,et al.  Two Kinds of Probabilistic Induction , 1999, Comput. J..

[53]  Manuela M. Veloso,et al.  The Lumberjack Algorithm for Learning Linked Decision Forests , 2000, PRICAI.

[54]  Erkki P. Liski Information and Complexity in Statistical Modeling by Jorma Rissanen , 2007 .

[55]  David L. Dowe,et al.  MML Inference of Decision Graphs with Multi-way Joins and Dynamic Attributes , 2002, Australian Conference on Artificial Intelligence.

[56]  Alexander Gammerman,et al.  Rejoinder Hedging Predictions in Machine Learning , 2007, Comput. J..

[57]  David L. Dowe,et al.  A computer program capable of passing I.Q. tests , 2008 .

[58]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[59]  C. S. Wallace,et al.  Hierarchical Clusters of Vegetation Types. , 2005 .

[60]  C. S. Wallace,et al.  Circular clustering of protein dihedral angles by Minimum Message Length. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[61]  Chris S. Wallace,et al.  A Program for Numerical Classification , 1970, Comput. J..

[62]  David L. Dowe,et al.  Inferring phylogenetic graphs of natural languages using minimum message length , 2005 .

[63]  Lloyd Allison,et al.  Change-Point Estimation Using New Minimum Message Length Approximations , 2002, PRICAI.

[64]  Daniel F. Schmidt,et al.  Minimum Message Length Shrinkage Estimation , 2009 .

[65]  Michael J. Pazzani,et al.  Exploring the Decision Forest , 1993 .

[66]  David L. Dowe,et al.  Unsupervised Learning of Correlated Multivariate Gaussian Mixture Models Using MML , 2003, Australian Conference on Artificial Intelligence.

[67]  C. S. Wallace,et al.  Bayesian Estimation of the Von Mises Concentration Parameter , 1996 .

[68]  Alexander Gammerman,et al.  Hedging predictions in machine learning , 2006, ArXiv.

[69]  David L. Dowe,et al.  Minimum message length and generalized Bayesian nets with asymmetric languages , 2005 .

[70]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[71]  Tony Lancaster,et al.  Orthogonal Parameters and Panel Data , 2002 .

[72]  C. S. Wallace,et al.  Learning Linear Causal Models by MML Sampling , 1999 .

[73]  David L. Dowe,et al.  Message Length Formulation of Support Vector Machines for Binary Classification - A Preliminary Scheme , 2002, Australian Joint Conference on Artificial Intelligence.

[74]  Rohan A. Baxter,et al.  MML and Bayesianism: similarities and differences: introduction to minimum encoding inference Part , 1994 .

[75]  David L. Dowe,et al.  Universal Bayesian inference , 2001 .

[76]  Xindong Wu,et al.  A Study of Causal Discovery With Weak Links and Small Samples , 1997, IJCAI.

[77]  Kevin B. Korb,et al.  Causal Discovery via MML , 1996, ICML.

[78]  Rocco A. Servedio,et al.  Discriminative learning can succeed where generative learning fails , 2006, Inf. Process. Lett..

[79]  Andrea Torsello,et al.  Learning a Generative Model for Structural Representations , 2008, Australasian Conference on Artificial Intelligence.

[80]  David L. Dowe,et al.  MML Mixture Models of Heterogeneous Poisson Processes with Uniform Outliers for Bridge Deterioration , 2006, Australian Conference on Artificial Intelligence.

[81]  C. S. WALLACE,et al.  Air Showers of Size Greater than 105 Particles: (I) Core Location and Shower Size Determination , 1958, Nature.

[82]  David L. Dowe,et al.  Decision Forests with Oblique Decision Trees , 2006, MICAI.

[83]  C. S. Wallace,et al.  Archaeoastronomy in the Old World: STONE CIRCLE GEOMETRIES: AN INFORMATION THEORY APPROACH , 1982 .

[84]  Michael J. Pazzani,et al.  Exploring the Decision Forest: An Empirical Investigation of Occam's Razor in Decision Tree Induction , 1993, J. Artif. Intell. Res..

[85]  A. Philip Dawid,et al.  Discussion of the Papers by Rissanen and by Wallace and Dowe , 1999, Comput. J..

[86]  C. S. Wallace,et al.  An Information Measure for Hierarchic Classification , 1973, Comput. J..

[87]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[88]  C. S. Wallace,et al.  Minimum Message Length Segmentation , 1998, PAKDD.

[89]  Leigh J. Fitzgibbon,et al.  Minimum message length autoregressive model order selection , 2004, International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of.

[90]  Peter W. Milne log[P(h/eb)/P(h/b)] Is the One True Measure of Confirmation , 1996, Philosophy of Science.

[91]  José Hernández-Orallo,et al.  Beyond the Turing Test , 2000, J. Log. Lang. Inf..

[92]  David L. Dowe,et al.  Refinements of MDL and MML Coding , 1999, Comput. J..

[93]  D L Dowe,et al.  The Melbourne Family Grief Study, I: Perceptions of family functioning in bereavement. , 1996, The American journal of psychiatry.

[94]  D M Boulton,et al.  The information content of a multistate distribution. , 1969, Journal of theoretical biology.

[95]  Ioannis Kontoyiannis Information and Complexity in Statistical Modeling by Jorma Rissanen , 2008, Am. Math. Mon..

[96]  Tony Jebara,et al.  Machine learning: Discriminative and generative , 2006 .

[97]  David L. Dowe,et al.  MML Inference of Decision Graphs with Multi-way Joins and Dynamic Attributes , 2003, Australian Conference on Artificial Intelligence.

[98]  David L. Dowe,et al.  Minimum Message Length and Kolmogorov Complexity , 1999, Comput. J..

[99]  Kevin B. Korb,et al.  In Search of the Philosopher's Stone: Remarks on Humphreys and Freedman's Critique of Causal Discovery , 1997, The British Journal for the Philosophy of Science.

[100]  Paul Winterrowd,et al.  Multiplier Evolution: A Family of Multiplier VLSI Implementations , 2008, Comput. J..

[101]  Jorma Rissanen,et al.  Generalized Kraft Inequality and Arithmetic Coding , 1976, IBM J. Res. Dev..

[102]  David L. Dowe,et al.  Minimum Message Length Clustering of Spatially-Correlated Data with Varying Inter-Class Penalties , 2007, 6th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2007).

[103]  Lloyd Allison,et al.  Minimum message length encoding, evolutionary trees and multiple-alignment , 1992, Proceedings of the Twenty-Fifth Hawaii International Conference on System Sciences.

[104]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[105]  David L. Dowe,et al.  A Preliminary MML Linear Classifier Using Principal Components for Multiple Classes , 2005, Australian Conference on Artificial Intelligence.

[106]  Andrea Torsello,et al.  Supervised learning of a generative model for edge-weighted graphs , 2008, 2008 19th International Conference on Pattern Recognition.

[107]  Chris S. Wallace,et al.  Perceptions of family functioning and cancer , 1994 .

[108]  John Langford,et al.  Suboptimal Behavior of Bayes and MDL in Classification Under Misspecification , 2004, COLT.

[109]  Kevin B. Korb,et al.  Finding Cutpoints in Noisy Binary Sequences - A Revised Empirical Evaluation , 1999, Australian Joint Conference on Artificial Intelligence.

[110]  A. Turing On Computable Numbers, with an Application to the Entscheidungsproblem. , 1937 .

[111]  Kevin B. Korb,et al.  Bayesian Networks with Non-Interacting Causes , 1999 .

[112]  David L. Dowe,et al.  Trading Rule Search with Autoregressive Inference Agents , 2005 .

[113]  Robert A. Wilson,et al.  Book Reviews: The MIT Encyclopedia of the Cognitive Sciences , 2000, CL.

[114]  Per Martin-Löf,et al.  The Definition of Random Sequences , 1966, Inf. Control..

[115]  S. Hagard England , 1995, The Knight and the Blast Furnace.

[116]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[117]  J. Lumley AUSTRALIA , 1920, The Lancet.

[118]  Peter L. Bartlett,et al.  Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds , 2006, NIPS.

[119]  孔子,et al.  The Analects of Confucius : Lun yu , 1997 .

[120]  David L. Dowe,et al.  General Bayesian networks and asymmetric languages , 2003 .

[121]  Gregory J. Chaitin,et al.  On the Length of Programs for Computing Finite Binary Sequences , 1966, JACM.

[122]  Lloyd Allison,et al.  Univariate Polynomial Inference by Monte Carlo Message Length Approximation , 2002, ICML.

[123]  J. Davenport Editor , 1960 .

[124]  David L. Dowe,et al.  A decision graph explanation of protein secondary structure prediction , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[125]  G. Chaitin Meta Math!: The Quest for Omega , 2004, math/0404335.

[126]  David L. Dowe,et al.  Minimum Message Length and Statistically Consistent Invariant (Objective?) Bayesian Probabilistic Inference—From (Medical) “Evidence” , 2008 .

[127]  David L. Dowe,et al.  Information-Theoretic Image Reconstruction and Segmentation from Noisy Projections , 2009, Australasian Conference on Artificial Intelligence.

[128]  Lloyd Allison,et al.  MML Markov classification of sequential data , 1999, Stat. Comput..

[129]  David L. Dowe,et al.  Database Normalization as a By-product of Minimum Message Length Inference , 2010, Australasian Conference on Artificial Intelligence.

[130]  Jonathan J. Oliver Decision Graphs - An Extension of Decision Trees , 1993 .

[131]  C. S. Wallace,et al.  Single-factor analysis by minimum message length estimation , 1992 .

[132]  Ray J. Solomonoff,et al.  Three Kinds of Probabilistic Induction: Universal Distributions and Convergence Theorems , 2008, Comput. J..

[133]  Jonathan J. Oliver,et al.  Averaging over decision trees , 1996 .

[134]  Carlo Kopp,et al.  Password-Capabilities and the Walnut Kernel , 2008, Comput. J..

[135]  Jorma Rissanen,et al.  Hypothesis Selection and Testing by the MDL Principle , 1999, Comput. J..

[136]  P. Schönemann On artificial intelligence , 1985, Behavioral and Brain Sciences.

[137]  C. S. Wallace,et al.  Coding Decision Trees , 1993, Machine Learning.

[138]  Trevor I. Dix,et al.  Building Classification Models from Microarray Data with Tree-Based Classification Algorithms , 2007, Australian Conference on Artificial Intelligence.

[139]  Kevin B. Korb,et al.  The discovery of causal models with small samples , 1996, 1996 Australian New Zealand Conference on Intelligent Information Systems. Proceedings. ANZIIS 96.

[140]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[141]  Grace W. Rumantir,et al.  Sampling of Highly Correlated Data for Polynomial Regression and Model Discovery , 2001, IDA.

[142]  Grace W. Rumantir,et al.  Minimum Message Length Criterion for Second-Order Polynomial Model Selection Applied to Tropical Cyclone Intensity Forecasting , 2003, IDA.

[143]  Ray J. Solomonoff,et al.  The Discovery of Algorithmic Probability , 1997, J. Comput. Syst. Sci..

[144]  C. Bishop The MIT Encyclopedia of the Cognitive Sciences , 1999 .

[145]  Rohan A. Baxter Minimum Message Length , 2010, Encyclopedia of Machine Learning.

[146]  Jürgen Schmidhuber,et al.  Simple Algorithmic Principles of Discovery, Subjective Beauty, Selective Attention, Curiosity & Creativity , 2007, Discovery Science.

[147]  Michael A. B. Deakin,et al.  The characterisation of scoring functions , 2001, Journal of the Australian Mathematical Society.

[148]  Kevin B. Korb,et al.  Bayesian Information Reward , 2002, Australian Joint Conference on Artificial Intelligence.

[149]  Franz Huber,et al.  Milne’s Argument for the Log‐Ratio Measure* , 2008, Philosophy of Science.

[150]  C. S. Wallace,et al.  A General Selection Criterion for Inductive Inference , 1984, ECAI.

[151]  M. H. Brennan Data Processing in the Early Cosmic Ray Experiments in Sydney , 2008, Comput. J..

[152]  David L. Dowe,et al.  Stock market simulation and inference technique , 2005, Fifth International Conference on Hybrid Intelligent Systems (HIS'05).

[153]  C. Laymon A. study , 2018, Predication and Ontology.

[154]  David L. Dowe,et al.  A computational extension to the Turing test , 1997 .

[155]  Enes Makalic,et al.  MML Inference of Single-layer Neural Networks , 2003 .

[156]  C. S. Wallace,et al.  Constructing a Minimal Diagnostic Decision Tree , 1993, Methods of Information in Medicine.

[157]  David L. Dowe,et al.  MML Clustering of Continuous-Valued Data Using Gaussian and t Distributions , 2002, Australian Joint Conference on Artificial Intelligence.

[158]  David L. Dowe,et al.  A Non-Behavioural, Computational Extension to the Turing Test , 1998 .

[159]  David L. Dowe,et al.  Message Length as an Effective Ockham's Razor in Decision Tree Induction , 2001, International Conference on Artificial Intelligence and Statistics.

[160]  Murray Jorgensen,et al.  Personal Crunching: Data Mining , 1998 .

[161]  J. Neyman,et al.  Consistent Estimates Based on Partially Consistent Observations , 1948 .

[162]  D. Lewis Probabilities of Conditionals and Conditional Probabilities , 1976 .

[163]  Kevin B. Korb,et al.  Learning Bayesian Networks with Restricted Causal Interactions , 1999, UAI.

[164]  Geoffrey J. McLachlan,et al.  Wallace's Approach to Unsupervised Learning: The Snob Program , 2008, Comput. J..

[165]  Shane Legg,et al.  Universal Intelligence: A Definition of Machine Intelligence , 2007, Minds and Machines.

[166]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[167]  José Hernández-Orallo,et al.  Measuring universal intelligence: Towards an anytime intelligence test , 2010, Artif. Intell..

[168]  C. S. Wallace,et al.  Finite-state models in the alignment of macromolecules , 1992, Journal of Molecular Evolution.

[169]  Nizar Bouguila,et al.  High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based on Minimum Message Length , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[170]  David L. Dowe,et al.  Point Estimation Using the Kullback-Leibler Loss Function and MML , 1998, PAKDD.

[171]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[172]  C. S. Wallace,et al.  The posterior probability distribution of alignments and its application to parameter estimation of evolutionary trees and to optimization of multiple alignments , 1994, Journal of Molecular Evolution.

[173]  C. Q. Lee,et al.  The Computer Journal , 1958, Nature.

[174]  David L. Dowe,et al.  Single Factor Analysis in MML Mixture Modelling , 1998, PAKDD.

[175]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.