Nonnegative Restricted Boltzmann Machines for Parts-based Representations Discovery and Predictive Model Stabilization

The success of any machine learning system depends critically on effective representations of data. In many cases, it is desirable that a representation scheme uncovers the parts-based, additive nature of the data. Of current representation learning schemes, restricted Boltzmann machines (RBMs) have proved to be highly effective in unsupervised settings. However, when it comes to parts-based discovery, RBMs do not usually produce satisfactory results. We enhance such capacity of RBMs by introducing nonnegativity into the model weights, resulting in a variant called nonnegative restricted Boltzmann machine (NRBM). The NRBM produces not only controllable decomposition of data into interpretable parts but also offers a way to estimate the intrinsic nonlinear dimensionality of data, and helps to stabilize linear predictive models. We demonstrate the capacity of our model on applications such as handwritten digit recognition, face recognition, document classification and patient readmission prognosis. The decomposition quality on images is comparable with or better than what produced by the nonnegative matrix factorization (NMF), and the thematic features uncovered from text are qualitatively interpretable in a similar manner to that of the latent Dirichlet allocation (LDA). The stability performance of feature selection on medical data is better than RBM and competitive with NMF. The learned features, when used for classification, are more discriminative than those discovered by both NMF and LDA and comparable with those by RBM.

[1]  Svetha Venkatesh,et al.  Latent Patient Profile Modelling and Applications with Mixed-Variate Restricted Boltzmann Machine , 2013, PAKDD.

[2]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[3]  Lawrence Carin,et al.  Nonparametric factor analysis with beta process priors , 2009, ICML '09.

[4]  Ian T. Jolliffe,et al.  Principal Component Analysis , 1986, Springer Series in Statistics.

[5]  Truyen Tran,et al.  Stabilizing High-Dimensional Prediction Models Using Feature Graphs , 2015, IEEE Journal of Biomedical and Health Informatics.

[6]  Wei Luo,et al.  An integrated framework for suicide risk prediction , 2013, KDD.

[7]  Jiawei Han,et al.  Document clustering using locality preserving indexing , 2005, IEEE Transactions on Knowledge and Data Engineering.

[9]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[10]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[11]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[12]  Susan Hutfless,et al.  Mining high-dimensional administrative claims data to predict early hospital readmissions , 2014, J. Am. Medical Informatics Assoc..

[13]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[14]  Antonio Torralba,et al.  Building the gist of a scene: the role of global image features in recognition. , 2006, Progress in brain research.

[15]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[17]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[18]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[19]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[20]  Jochen J. Steil,et al.  Online learning and generalization of parts-based image representations by non-negative sparse autoencoders , 2012, Neural Networks.

[21]  Peter C Austin,et al.  Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality. , 2004, Journal of clinical epidemiology.

[22]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[23]  Trevor Hastie,et al.  Averaged gene expressions for regression. , 2007, Biostatistics.

[24]  Charlotte Soneson,et al.  A framework for list representation, enabling list stabilization through incorporation of gene exchangeabilities. , 2011, Biostatistics.

[25]  I. Jolliffe Principal Component Analysis , 2002 .

[26]  Wei Luo,et al.  Stabilized sparse ordinal regression for medical risk stratification , 2014, Knowledge and Information Systems.

[27]  Taghi M. Khoshgoftaar,et al.  A review of the stability of feature selection techniques for bioinformatics data , 2012, 2012 IEEE 13th International Conference on Information Reuse & Integration (IRI).

[28]  Chris H. Q. Ding,et al.  Stable feature selection via dense feature groups , 2008, KDD.

[29]  Dan Roth,et al.  Learning to detect objects in images via a sparse, part-based representation , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[31]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[32]  Z. Birnbaum On a Use of the Mann-Whitney Statistic , 1956 .

[33]  Geoffrey E. Hinton,et al.  Generative models for discovering sparse distributed representations. , 1997, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[34]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[35]  Shie Mannor,et al.  Sparse algorithms are not stable: A no-free-lunch theorem , 2008, Allerton 2008.

[36]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[37]  Svetha Venkatesh,et al.  Learning Parts-based Representations with Nonnegative Restricted Boltzmann Machine , 2013, ACML.

[38]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[39]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[40]  David Haussler,et al.  Unsupervised learning of distributions on binary vectors using two layer networks , 1991, NIPS 1991.

[41]  Fei Wang,et al.  SOR: Scalable Orthogonal Regression for Low-Redundancy Feature Selection and its Healthcare Applications , 2012, SDM.

[42]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[43]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[44]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[45]  Tapani Raiko,et al.  Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information , 2013, ICONIP.

[46]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[47]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[48]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[49]  Melanie Hilario,et al.  Knowledge and Information Systems , 2007 .

[50]  Josef Kittler,et al.  Improving Stability of Feature Selection Methods , 2007, CAIP.

[51]  Thomas L. Griffiths,et al.  Infinite latent feature models and the Indian buffet process , 2005, NIPS.

[52]  Yee Whye Teh,et al.  Rate-coded Restricted Boltzmann Machines for Face Recognition , 2000, NIPS.

[53]  Taghi M. Khoshgoftaar,et al.  A survey of stability analysis of feature subset selection techniques , 2013, 2013 IEEE 14th International Conference on Information Reuse & Integration (IRI).

[54]  Justin Zobel,et al.  Prediction of breast cancer prognosis using gene set statistics provides signature stability and biological context , 2010, BMC Bioinformatics.

[55]  R. Real,et al.  The Probabilistic Basis of Jaccard's Index of Similarity , 1996 .