Graph-induced restricted Boltzmann machines for document modeling

Discovering knowledge from unstructured texts is a central theme in data mining and machine learning. We focus on fast discovery of thematic structures from a corpus. Our approach is based on a versatile probabilistic formulation - the restricted Boltzmann machine (RBM) - where the underlying graphical model is an undirected bipartite graph. Inference is efficient - document representation can be computed with a single matrix projection, making RBMs suitable for massive text corpora available today. Standard RBMs, however, operate on bag-of-words assumption, ignoring the inherent underlying relational structures among words. This results in less coherent word thematic grouping. We introduce graph-based regularization schemes that exploit the linguistic structures, which in turn can be constructed from either corpus statistics or domain knowledge. We demonstrate that the proposed technique improves the group coherence, facilitates visualization, provides means for estimation of intrinsic dimensionality, reduces overfitting, and possibly leads to better classification accuracy.

[1]  Jean Ponce,et al.  Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[3]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[4]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[5]  Edwin V. Bonilla,et al.  Improving Topic Coherence with Regularized Topic Models , 2011, NIPS.

[6]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[7]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Xiaojin Zhu,et al.  Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence A Framework for Incorporating General Domain Knowledge into Latent Dirichlet Allocation Using First-Order Logic , 2022 .

[10]  Hongzhe Li,et al.  In Response to Comment on "Network-constrained regularization and variable selection for analysis of genomic data" , 2008, Bioinform..

[11]  Max Welling,et al.  Bayesian Random Fields: The Bethe-Laplace Approximation , 2006, UAI.

[12]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[13]  Andrew Y. Ng,et al.  Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[14]  Cun-Hui Zhang,et al.  A group bridge approach for variable selection , 2009, Biometrika.

[15]  Hua Xu,et al.  Constrained LDA for Grouping Product Features in Opinion Mining , 2011, PAKDD.

[16]  Svetha Venkatesh,et al.  Learning Parts-based Representations with Nonnegative Restricted Boltzmann Machine , 2013, ACML.

[17]  Xiaofei He,et al.  Locality Preserving Projections , 2003, NIPS.

[18]  Geoffrey E. Hinton,et al.  Replicated Softmax: an Undirected Topic Model , 2009, NIPS.

[19]  George Hripcsak,et al.  Next-generation phenotyping of electronic health records , 2012, J. Am. Medical Informatics Assoc..

[20]  Yann LeCun,et al.  Structured sparse coding via lateral inhibition , 2011, NIPS.

[21]  Fan Zhang,et al.  The Cluster Elastic Net for High-Dimensional Regression With Unknown Variable Grouping , 2014, Technometrics.

[22]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[23]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[24]  Michael Elad,et al.  Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Andrew Y. Ng,et al.  The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization , 2011, ICML.

[26]  Hao Helen Zhang,et al.  Consistent Group Identification and Variable Selection in Regression With Correlated Predictors , 2013, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[27]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[28]  Yanghui Rao,et al.  Sentiment topic models for social emotion mining , 2014, Inf. Sci..

[29]  Geoffrey E. Hinton,et al.  Modeling pixel means and covariances using factorized third-order boltzmann machines , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[30]  Ryan P. Adams,et al.  Priors for Diversity in Generative Latent Variable Models , 2012, NIPS.

[31]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[32]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[33]  L. Younes Parametric Inference for imperfectly observed Gibbsian fields , 1989 .

[34]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[35]  Svetha Venkatesh,et al.  Latent Patient Profile Modelling and Applications with Mixed-Variate Restricted Boltzmann Machine , 2013, PAKDD.

[36]  Jiawei Han,et al.  Document clustering using locality preserving indexing , 2005, IEEE Transactions on Knowledge and Data Engineering.

[37]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[38]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[39]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[40]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[41]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[42]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[43]  Arjun Mukherjee,et al.  Leveraging Multi-Domain Prior Knowledge in Topic Models , 2013, IJCAI.

[44]  Svetha Venkatesh,et al.  Learning sparse latent representation and distance metric for image retrieval , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[45]  David B. Dunson,et al.  Probabilistic topic models , 2011, KDD '11 Tutorials.

[46]  Michael S. Bernstein,et al.  Eddi: interactive topic-based browsing of social status streams , 2010, UIST.

[47]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[48]  Hongliang Fei,et al.  Regularization and feature selection for networked features , 2010, CIKM '10.

[49]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[50]  Ryan P. Adams,et al.  Training Restricted Boltzmann Machines on Word Observations , 2012, ICML.

[51]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Andrew Y. Ng,et al.  Semantic Compositionality through Recursive Matrix-Vector Spaces , 2012, EMNLP.

[53]  A. Gelfand,et al.  The Nested Dirichlet Process , 2008 .

[54]  Li Yao,et al.  Bounding the Test Log-Likelihood of Generative Models , 2014, ICLR.

[55]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[56]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[57]  Yoram Reich,et al.  Strengthening learning algorithms by feature discovery , 2012, Inf. Sci..

[58]  H. Bondell,et al.  Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR , 2008, Biometrics.

[59]  Dunja Mladenic,et al.  The Role of Hubness in Clustering High-Dimensional Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[60]  Svetha Venkatesh,et al.  Mixed-Variate Restricted Boltzmann Machines , 2014, ACML.

[61]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[62]  Thomas L. Griffiths,et al.  Infinite latent feature models and the Indian buffet process , 2005, NIPS.

[63]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[64]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[65]  Francis R. Bach,et al.  Structured Sparse Principal Component Analysis , 2009, AISTATS.

[66]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[67]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[68]  John Blitzer,et al.  Regularized Learning with Networks of Features , 2008, NIPS.

[69]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[70]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[71]  Matt Gardner The Topic Browser An Interactive Tool for Browsing Topic Models , 2010 .

[72]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[73]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[74]  Nitish Srivastava,et al.  Modeling Documents with Deep Boltzmann Machines , 2013, UAI.

[75]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[76]  E. Xing,et al.  Bayesian Exponential Family Harmoniums , 2004 .

[77]  M. Escobar,et al.  Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[78]  David Kauchak,et al.  Improving Text Simplification Language Modeling Using Unsimplified Text Data , 2013, ACL.

[79]  Svetha Venkatesh,et al.  Learning Boltzmann Distance Metric for Face Recognition , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[80]  Geoffrey E. Hinton,et al.  Using fast weights to improve persistent contrastive divergence , 2009, ICML '09.

[81]  Svetha Venkatesh,et al.  Thurstonian Boltzmann Machines: Learning from Multiple Inequalities , 2013, ICML.

[82]  Lingmin Zeng,et al.  Group variable selection for data with dependent structures , 2012 .

[83]  Svetha Venkatesh,et al.  Ordinal Boltzmann Machines for Collaborative Filtering , 2009, UAI.

[84]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.