Effective Bayesian inference for sparse factor analysis models

We study how to perform effective Bayesian inference in high-dimensional sparse Factor Analysis models with a zero-norm, sparsity-inducing prior on the model parameters. Such priors represent a methodological ideal, but Bayesian inference in such models is usually regarded as impractical. We test this view. After empirically characterising the properties of existing algorithmic approaches, we use techniques from statistical mechanics to derive a theory of optimal learning in the restricted setting of sparse PCA with a single factor. Finally, we describe a novel `Dense Message Passing' algorithm (DMP) which achieves near-optimal performance on synthetic data generated from this model.DMP exploits properties of high-dimensional problems to operate successfully on a densely connected graphical model. Similar algorithms have been developed in the statistical physics community and previously applied to inference problems in coding and sparse classification. We demonstrate that DMP out-performs both a newly proposed variational hybrid algorithm and two other recently published algorithms (SPCA and emPCA) on synthetic data while it explains at least the same amount of variance, for a given level of sparsity, in two gene expression datasets used in previous studies of sparse PCA.A significant potential advantage of DMP is that it provides an estimate of the marginal likelihood which can be used for hyperparameter optimisation. We show that, for the single factor case, this estimate exhibits good qualitative agreement both with theoretical predictions and with the hyperparameter posterior inferred by a collapsed Gibbs sampler. Preliminary work on an extension to inference of multiple factors indicates its potential for selecting an optimal model from amongst candidates which differ both in numbers of factors and their levels of sparsity.

[1]  Matthias W. Seeger,et al.  Bayesian Gaussian process models : PAC-Bayesian generalisation error bounds and sparse approximations , 2003 .

[2]  E. George,et al.  APPROACHES FOR BAYESIAN VARIABLE SELECTION , 1997 .

[3]  Charles M. Bishop,et al.  Variational Message Passing , 2005, J. Mach. Learn. Res..

[4]  W. Davidon,et al.  Mathematical Methods of Physics , 1965 .

[5]  H. Kaiser The varimax criterion for analytic rotation in factor analysis , 1958 .

[6]  Joachim M. Buhmann,et al.  Expectation-maximization for sparse and non-negative PCA , 2008, ICML '08.

[7]  M. Opper,et al.  From Naive Mean Field Theory to the TAP Equations , 2001 .

[8]  Lester W. Mackey,et al.  Deflation Methods for Sparse PCA , 2008, NIPS.

[9]  I. Jolliffe,et al.  A Modified Principal Component Technique Based on the LASSO , 2003 .

[10]  Alan L. Yuille,et al.  The Concave-Convex Procedure (CCCP) , 2001, NIPS.

[11]  J. Geweke,et al.  Variable selection and model comparison in regression , 1994 .

[12]  Tom Heskes,et al.  Fractional Belief Propagation , 2002, NIPS.

[13]  William H. Press,et al.  Numerical recipes , 1990 .

[14]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[15]  Yoshiyuki Kabashima,et al.  Statistical Mechanical Development of a Sparse Bayesian Classifier , 2005 .

[16]  W. Wiegerinck,et al.  Approximate inference techniques with expectation constraints , 2005 .

[17]  Y. Kabashima A CDMA multiuser detection algorithm on the basis of belief propagation , 2003 .

[18]  Ole Winther,et al.  TAP Gibbs Free Energy, Belief Propagation and Sparsity , 2001, NIPS.

[19]  Florian Steinke,et al.  Bayesian Inference and Optimal Design in the Sparse Linear Model , 2007, AISTATS.

[20]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[21]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[22]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[23]  Yee Whye Teh,et al.  The Unified Propagation and Scaling Algorithm , 2001, NIPS.

[24]  Yee Whye Teh,et al.  A New View of ICA , 2001 .

[25]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[26]  Max Welling,et al.  On the Choice of Regions for Generalized Belief Propagation , 2004, UAI.

[27]  Thomas P. Minka,et al.  The EP energy function and minimization schemes , 2001 .

[28]  Magnus Rattray,et al.  PCA learning for sparse high-dimensional data , 2003 .

[29]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[30]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[31]  James G. Scott,et al.  Handling Sparsity via the Horseshoe , 2009, AISTATS.

[32]  James G. Scott,et al.  The horseshoe estimator for sparse signals , 2010 .

[33]  David P. Wipf,et al.  A New View of Automatic Relevance Determination , 2007, NIPS.

[34]  David Saad,et al.  Improved message passing for inference in densely connected systems , 2005, ArXiv.

[35]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[36]  Shai Avidan,et al.  Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms , 2005, NIPS.

[37]  Geert Jan Bex,et al.  A Gaussian scenario for unsupervised learning , 1996 .

[38]  Panos M. Pardalos,et al.  Introduction to Global Optimization , 2000, Introduction to Global Optimization.

[39]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[40]  Yoshiyuki Kabashima,et al.  Belief propagation vs. TAP for decoding corrupted messages , 1998 .

[41]  Brendan J. Frey,et al.  A Revolution: Belief Propagation in Graphs with Cycles , 1997, NIPS.

[42]  H. Bethe Statistical Theory of Superlattices , 1935 .

[43]  G. Arfken Mathematical Methods for Physicists , 1967 .

[44]  Martin J. Wainwright,et al.  Tree-reweighted belief propagation algorithms and approximate ML estimation by pseudo-moment matching , 2003, AISTATS.

[45]  Hagai Attias,et al.  A Variational Bayesian Framework for Graphical Models , 1999 .

[46]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[47]  D. F. Andrews,et al.  Scale Mixtures of Normal Distributions , 1974 .

[48]  Alan L. Yuille,et al.  CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies: Convergent Alternatives to Belief Propagation , 2002, Neural Computation.

[49]  Manfred Opper,et al.  Adaptive TAP Equations , 2001 .

[50]  D. Lockhart,et al.  Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[51]  Jung-Fu Cheng,et al.  Turbo Decoding as an Instance of Pearl's "Belief Propagation" Algorithm , 1998, IEEE J. Sel. Areas Commun..

[52]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[53]  James G. Scott,et al.  Shrink Globally, Act Locally: Sparse Bayesian Regularization and Prediction , 2022 .

[54]  Magnus Rattray,et al.  Inference algorithms and learning theory for Bayesian sparse factor analysis , 2009 .

[55]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[56]  Tom Heskes,et al.  On the Uniqueness of Loopy Belief Propagation Fixed Points , 2004, Neural Computation.

[57]  Jonathan Harel,et al.  Poset belief propagation-experimental results , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[58]  Yurii Nesterov,et al.  Generalized Power Method for Sparse Principal Component Analysis , 2008, J. Mach. Learn. Res..

[59]  R. Kikuchi A Theory of Cooperative Phenomena , 1951 .

[60]  B. Carlin,et al.  Diagnostics: A Comparative Review , 2022 .

[61]  Christian Van den Broeck,et al.  Statistical Mechanics of Learning , 2001 .

[62]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[63]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[64]  M. Weigt,et al.  Inference algorithms for gene networks: a statistical mechanics analysis , 2008, 0812.0940.

[65]  Matthew West,et al.  Bayesian factor regression models in the''large p , 2003 .

[66]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[67]  Tommi S. Jaakkola,et al.  Tutorial on variational approximation methods , 2000 .

[68]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[69]  Peter Sollich,et al.  Theory of Neural Information Processing Systems , 2005 .

[70]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[71]  Magnus Rattray,et al.  Dense Message Passing for Sparse Principal Component Analysis , 2010, AISTATS.

[72]  M. Talagrand,et al.  Spin Glasses: A Challenge for Mathematicians , 2003 .

[73]  Andrea Pagnani,et al.  Statistical mechanics of sparse generalization and graphical model selection , 2009 .

[74]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[75]  Jennifer G. Dy,et al.  Sparse Probabilistic Principal Component Analysis , 2009, AISTATS.

[76]  Lorenz Wernisch,et al.  Factor analysis for gene regulatory networks and transcription factor activity profiles , 2007, BMC Bioinformatics.

[77]  Tom Heskes,et al.  Convexity Arguments for Efficient Minimization of the Bethe and Kikuchi Free Energies , 2006, J. Artif. Intell. Res..

[78]  Giorgio Parisi,et al.  SK Model: The Replica Solution without Replicas , 1986 .

[79]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[80]  Bhaskar D. Rao,et al.  Perspectives on Sparse Bayesian Learning , 2003, NIPS.

[81]  T. Heskes Stable Fixed Points of Loopy Belief Propagation Are Minima of the Bethe Free Energy , 2002 .

[82]  Hilbert J. Kappen,et al.  Approximate Inference and Constrained Optimization , 2002, UAI.

[83]  Ole Winther,et al.  Expectation Consistent Approximate Inference , 2005, J. Mach. Learn. Res..

[84]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[85]  G. Casella,et al.  The Bayesian Lasso , 2008 .

[86]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[87]  T. J. Mitchell,et al.  Bayesian Variable Selection in Linear Regression , 1988 .

[88]  Yair Weiss,et al.  Correctness of Local Probability Propagation in Graphical Models with Loops , 2000, Neural Computation.

[89]  Alexandre d'Aspremont,et al.  Optimal Solutions for Sparse Principal Component Analysis , 2007, J. Mach. Learn. Res..

[90]  R. T. Cox Probability, frequency and reasonable expectation , 1990 .

[91]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[92]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[93]  Tom Minka,et al.  Expectation Propagation for approximate Bayesian inference , 2001, UAI.

[94]  P. Kuhlmann,et al.  On the generalization ability of diluted perceptrons , 1994 .

[95]  W. Freeman,et al.  Generalized Belief Propagation , 2000, NIPS.

[96]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[97]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[98]  S. Stenholm Information, Physics and Computation, by Marc Mézard and Andrea Montanari , 2010 .

[99]  Robert J. McEliece,et al.  The generalized distributive law , 2000, IEEE Trans. Inf. Theory.

[100]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[101]  Neil D. Lawrence,et al.  Probabilistic inference of transcription factor concentrations and gene-specific regulatory activities , 2006, Bioinform..

[102]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[103]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[104]  Donald Geman,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1984 .

[105]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[106]  E. Fokoue Stochastic Determination of the Intrinsic Structure in Bayesian Factor Analysis , 2004 .

[107]  D. Thouless,et al.  Stability of the Sherrington-Kirkpatrick solution of a spin glass model , 1978 .

[108]  E. B. Andersen,et al.  Modern factor analysis , 1961 .

[109]  Radford M. Neal Probabilistic Inference Using Markov Chain Monte Carlo Methods , 2011 .

[110]  K. Jöreskog Some contributions to maximum likelihood factor analysis , 1967 .

[111]  Chiara Sabatti,et al.  Bayesian sparse hidden components analysis for transcription regulation networks , 2005, Bioinform..

[112]  William T. Freeman,et al.  Constructing free-energy approximations and generalized belief propagation algorithms , 2005, IEEE Transactions on Information Theory.

[113]  Stephen F. Gull,et al.  Developments in Maximum Entropy Data Analysis , 1989 .

[114]  A. Gelman,et al.  Weak convergence and optimal scaling of random walk Metropolis algorithms , 1997 .

[115]  Arnaud Doucet,et al.  Sparse Bayesian nonparametric regression , 2008, ICML '08.

[116]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[117]  Matthias W. Seeger,et al.  Fast Convergent Algorithms for Expectation Propagation Approximate Bayesian Inference , 2010, AISTATS.

[118]  T. Heskes,et al.  Expectation propagation for approximate inference in dynamic bayesian networks , 2002, UAI 2002.

[119]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[120]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[121]  F. Crick Central Dogma of Molecular Biology , 1970, Nature.

[122]  David J. C. MacKay,et al.  Bayesian Methods for Backpropagation Networks , 1996 .

[123]  C. Rasmussen,et al.  Approximations for Binary Gaussian Process Classification , 2008 .

[124]  X. Jin Factor graphs and the Sum-Product Algorithm , 2002 .

[125]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[126]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[127]  E. Gardner The space of interactions in neural network models , 1988 .

[128]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[129]  Francis R. Bach,et al.  Sparse probabilistic projections , 2008, NIPS.

[130]  Sylvia Frühwirth-Schnatter MCMC Estimation of Classical and Dynamic Switching and Mixture Models , 1998 .

[131]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[132]  T. Hughes,et al.  Exploration of Essential Gene Functions via Titratable Promoter Alleles , 2004, Cell.

[133]  David J. C. MacKay,et al.  Comparison of Approximate Methods for Handling Hyperparameters , 1999, Neural Computation.

[134]  E. Gardner,et al.  Maximum Storage Capacity in Neural Networks , 1987 .

[135]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[136]  J. Griffin,et al.  Inference with normal-gamma prior distributions in regression problems , 2010 .

[137]  Thomas P. Minka,et al.  Divergence measures and message passing , 2005 .

[138]  C. Robert,et al.  Estimation of Finite Mixture Distributions Through Bayesian Sampling , 1994 .