The Stochastic Complexity of Spin Models: Are Pairwise Models Really Simple?

Models can be simple for different reasons: because they yield a simple and computationally efficient interpretation of a generic dataset (e.g., in terms of pairwise dependencies)—as in statistical learning—or because they capture the laws of a specific phenomenon—as e.g., in physics—leading to non-trivial falsifiable predictions. In information theory, the simplicity of a model is quantified by the stochastic complexity, which measures the number of bits needed to encode its parameters. In order to understand how simple models look like, we study the stochastic complexity of spin models with interactions of arbitrary order. We show that bijections within the space of possible interactions preserve the stochastic complexity, which allows to partition the space of all models into equivalence classes. We thus found that the simplicity of a model is not determined by the order of the interactions, but rather by their mutual arrangements. Models where statistical dependencies are localized on non-overlapping groups of few variables are simple, affording predictions on independencies that are easy to falsify. On the contrary, fully connected pairwise models, which are often used in statistical learning, appear to be highly complex, because of their extended set of interactions, and they are hard to falsify.

[1]  Paul McKellips,et al.  Are we there yet? , 2014, Lab Animal.

[2]  E. Wigner The Unreasonable Effectiveness of Mathematics in the Natural Sciences (reprint) , 1960 .

[3]  Y. Tikochinsky,et al.  Alternative approach to maximum-entropy inference , 1984 .

[4]  Jorma Rissanen,et al.  Stochastic Complexity in Learning , 1995, J. Comput. Syst. Sci..

[5]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[6]  Saint John Walker Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2014 .

[7]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[8]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[9]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[10]  Ginestra Bianconi,et al.  Generalized network structures: The configuration model and the canonical ensemble of simplicial complexes. , 2016, Physical review. E.

[11]  Charles Anderson,et al.  The end of theory: The data deluge makes the scientific method obsolete , 2008 .

[12]  Michael I. Jordan,et al.  Variational inference in graphical models: The view from the marginal polytope , 2008 .

[13]  Nello Cristianini,et al.  Are We There Yet? , 2002, Science.

[14]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[15]  Adam A. Margolin,et al.  Multivariate dependence and genetic networks inference. , 2010, IET systems biology.

[16]  Alessandro Pelizzola,et al.  Cluster Variation Method in Statistical Physics and Probabilistic Graphical Models , 2005, ArXiv.

[17]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[18]  Jorma Rissanen,et al.  Strong optimality of the normalized ML models as universal codes and information in data , 2001, IEEE Trans. Inf. Theory.

[19]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[20]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[21]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[22]  I. J. Myung,et al.  Counting probability distributions: Differential geometry and model selection , 2000, Proc. Natl. Acad. Sci. USA.

[23]  Luigi Gresele,et al.  On Maximum Entropy and Inference , 2017, Entropy.

[24]  Michael J. Berry,et al.  Weak pairwise correlations imply strongly correlated network states in a neural population , 2005, Nature.

[25]  N. Chater,et al.  Simplicity: a unifying principle in cognitive science? , 2003, Trends in Cognitive Sciences.

[26]  Matteo Marsili,et al.  Sparse model selection in the highly under-sampled regime , 2016, 1603.00952.

[27]  I. Mastromatteo On the typical properties of inverse problems in statistical mechanics , 2013, 1311.0190.

[28]  Vijay Balasubramanian,et al.  Statistical Inference, Occam's Razor, and Statistical Mechanics on the Space of Probability Distributions , 1996, Neural Computation.

[29]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[30]  Ilya Nemenman,et al.  On the Sufficiency of Pairwise Interactions in Maximum Entropy Models of Networks , 2015, Journal of Statistical Physics.

[31]  H. Kramers,et al.  Statistics of the Two-Dimensional Ferromagnet. Part II , 1941 .

[32]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[33]  J. Lafferty,et al.  High-dimensional Ising model selection using ℓ1-regularized logistic regression , 2010, 1010.0311.

[34]  R. Zecchina,et al.  Inverse statistical problems: from the inverse Ising problem to data science , 2017, 1702.01522.

[35]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[36]  William Bialek,et al.  Statistical Mechanics of the US Supreme Court , 2013, Journal of Statistical Physics.

[37]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[38]  Yann LeCun,et al.  Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[39]  L. Wasserman,et al.  The Selection of Prior Distributions by Formal Rules , 1996 .