Statistical mechanics of sparse generalization and graphical model selection

One of the crucial tasks in many inference problems is the extraction of an underlying sparse graphical model from a given number of high-dimensional measurements. In machine learning, this is frequently achieved using, as a penalty term, the Lp norm of the model parameters, with p≤1 for efficient dilution. Here we propose a statistical mechanics analysis of the problem in the setting of perceptron memorization and generalization. Using a replica approach, we are able to evaluate the relative performance of naive dilution (obtained by learning without dilution, following by applying a threshold to the model parameters), L1 dilution (which is frequently used in convex optimization) and L0 dilution (which is optimal but computationally hard to implement). Whereas both Lp diluted approaches clearly outperform the naive approach, we find a small region where L0 works almost perfectly and strongly outperforms the simpler to implement L1 dilution.

[1]  Györgyi,et al.  Inference of a rule by a neural network with thermal noise. , 1990, Physical review letters.

[2]  Yoshiyuki Kabashima,et al.  Inference from correlated patterns: a unified theory for perceptron learning and linear vector channels , 2007, ArXiv.

[3]  Yoshiyuki Kabashima,et al.  Statistical Mechanical Development of a Sparse Bayesian Classifier , 2005 .

[4]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[5]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[6]  M. Weigt,et al.  Gene-network inference by message passing , 2008, 0812.0936.

[7]  E. Gardner The space of interactions in neural network models , 1988 .

[8]  K Y M Wong,et al.  The Principle of Adaptation and Dilution Robustness in Neural Networks , 1991 .

[9]  Andrea Pagnani,et al.  Classification and sparse-signature extraction from gene-expression data , 2009, 0907.3687.

[10]  P. Kuhlmann,et al.  A dilution algorithm for neural networks , 1992 .

[11]  A. Komoda,et al.  Quenched versus annealed dilution in neural networks , 1990 .

[12]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[13]  Sidney Rosenbaum Precursors of the, Journal of the Royal Statistical Society , 2001 .

[14]  Robustness against random dilution in attractor neural networks , 1991 .

[15]  P. Kuhlmann,et al.  In search of an optimal dilution algorithm for feedforward networks , 1992 .

[16]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[17]  P. Kuhlmann,et al.  On the generalization ability of diluted perceptrons , 1994 .

[18]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[19]  Hendrik B. Geyer,et al.  Journal of Physics A - Mathematical and General, Special Issue. SI Aug 11 2006 ?? Preface , 2006 .

[20]  Christian Van den Broeck,et al.  Statistical Mechanics of Learning , 2001 .

[21]  Mi Heggie,et al.  Journal of Physics: Conference Series: Preface , 2011 .

[22]  E. Gardner,et al.  Maximum Storage Capacity in Neural Networks , 1987 .

[23]  Malzahn Learning strategies for the maximally stable diluted binary perceptron , 2000, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[24]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[25]  M. Weigt,et al.  Inference algorithms for gene networks: a statistical mechanics analysis , 2008, 0812.0940.