Low-Rank Constraints for Fast Inference in Structured Models

Structured distributions, i.e. distributions over combinatorial spaces, are commonly used to learn latent probabilistic representations from observed data. However, scaling these models is bottlenecked by the high computational and memory complexity with respect to the size of the latent representations. Common models such as Hidden Markov Models (HMMs) and Probabilistic Context-Free Grammars (PCFGs) require time and space quadratic and cubic in the number of hidden states respectively. This work demonstrates a simple approach to reduce the computational and memory complexity of a large class of structured models. We show that by viewing the central inference step as a matrix-vector product and using a low-rank constraint, we can trade off model expressivity and speed via the rank. Experiments with neural parameterized structured models for language modeling, polyphonic music modeling, unsupervised grammar induction, and video modeling show that our approach matches the accuracy of standard models at large state spaces while providing practical speedups.

[1]  Roy Schwartz,et al.  Random Feature Attention , 2021, ICLR.

[2]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[3]  Dan Klein,et al.  Parsing and Hypergraphs , 2001, IWPT.

[4]  Alexander M. Rush,et al.  Compound Probabilistic Context-Free Grammars for Grammar Induction , 2019, ACL.

[5]  Atri Rudra,et al.  Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps , 2020, ICLR.

[6]  Shunzheng Yu,et al.  Hidden semi-Markov models , 2010, Artif. Intell..

[7]  Richard E. Turner,et al.  Neural Adaptive Sequential Monte Carlo , 2015, NIPS.

[8]  Sam T. Roweis,et al.  Constrained Hidden Markov Models , 1999, NIPS.

[9]  Bernhard Schölkopf,et al.  Learning with Hypergraphs: Clustering, Classification, and Embedding , 2006, NIPS.

[10]  Douglas Eck,et al.  An Improved Relative Self-Attention Mechanism for Transformer with Application to Music Generation , 2018, ArXiv.

[11]  Andrew W. Moore,et al.  Fast inference and learning in large-state-space HMMs , 2005, ICML '05.

[12]  Nir Friedman,et al.  Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning , 2009 .

[13]  Alexander M. Rush,et al.  Latent Normalizing Flows for Discrete Sequences , 2019, ICML.

[14]  Andrew McCallum,et al.  Embedded-State Latent Conditional Random Fields for Sequence Labeling , 2018, CoNLL.

[15]  Sebastian Ewert,et al.  Seq-U-Net: A One-Dimensional Causal U-Net for Efficient Sequence Modelling , 2019, IJCAI.

[16]  Ting Wang,et al.  Interpretable Deep Learning under Fire , 2018, USENIX Security Symposium.

[17]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[18]  Zhiwei Wang,et al.  R-Transformer: Recurrent Neural Network Enhanced Transformer , 2019, ArXiv.

[19]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[20]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[21]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[22]  Jon M. Kleinberg,et al.  Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis , 2003, NIPS.

[23]  Christian Osendorfer,et al.  Learning Stochastic Recurrent Networks , 2014, NIPS 2014.

[24]  Alexander M. Rush,et al.  Scaling Hidden Markov Language Models , 2020, EMNLP.

[25]  Kyungwoo Song,et al.  Bivariate Beta LSTM , 2019, AAAI.

[26]  Wolfgang Lehrach,et al.  Learning higher-order sequential structure with cloned HMMs , 2019, ArXiv.

[27]  Kewei Tu,et al.  PCFGs Can Do Better: Inducing Probabilistic Context-Free Grammars with Many Symbols , 2021, NAACL.

[28]  Mohsen Guizani,et al.  Adversarial attacks against profile HMM website fingerprinting detection model , 2019, Cognitive Systems Research.

[29]  Ivan Laptev,et al.  Cross-Task Weakly Supervised Learning From Instructional Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Hongpo Zhang,et al.  Label flipping attacks against Naive Bayes on spam filtering systems , 2021, Applied Intelligence.

[31]  Aaron C. Courville,et al.  Neural Language Modeling by Jointly Learning Syntax and Lexicon , 2017, ICLR.

[32]  Linyuan Lu,et al.  On a hypergraph probabilistic graphical model , 2018, Annals of Mathematics and Artificial Intelligence.

[33]  Zhe Gan,et al.  Deep Temporal Sigmoid Belief Networks for Sequence Modeling , 2015, NIPS.

[34]  Jakob Uszkoreit,et al.  Scaling Autoregressive Video Models , 2019, ICLR.

[35]  Ole Winther,et al.  Sequential Neural Models with Stochastic Layers , 2016, NIPS.

[36]  Uri Shalit,et al.  Structured Inference Networks for Nonlinear State Space Models , 2016, AAAI.

[37]  Shay B. Cohen,et al.  Weighted Tree Automata Approximation by Singular Value Truncation , 2015, ArXiv.

[38]  Lukás Burget,et al.  Empirical Evaluation and Combination of Advanced Language Modeling Techniques , 2011, INTERSPEECH.

[39]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[40]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[41]  David Chiang,et al.  Better k-best Parsing , 2005, IWPT.

[42]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[43]  Byron Boots,et al.  Reduced-Rank Hidden Markov Models , 2009, AISTATS.

[44]  Frank Hutter,et al.  Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[45]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[46]  David Chiang,et al.  Factor Graph Grammars , 2020, NeurIPS.

[47]  Aaron C. Courville,et al.  Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks , 2018, ICLR.