$t$ -Exponential Memory Networks for Question-Answering Machines

Recent advances in deep learning have brought to the fore models that can make multiple computational steps in the service of completing a task; these are capable of describing long-term dependencies in sequential data. Novel recurrent attention models over possibly large external memory modules constitute the core mechanisms that enable these capabilities. Our work addresses learning subtler and more complex underlying temporal dynamics in language modeling tasks that deal with sparse sequential data. To this end, we improve upon these recent advances by adopting concepts from the field of Bayesian statistics, namely, variational inference. Our proposed approach consists in treating the network parameters as latent variables with a prior distribution imposed over them. Our statistical assumptions go beyond the standard practice of postulating Gaussian priors. Indeed, to allow for handling outliers, which are prevalent in long observed sequences of multivariate data, multivariate <inline-formula> <tex-math notation="LaTeX">$t$ </tex-math></inline-formula>-exponential distributions are imposed. On this basis, we proceed to infer corresponding posteriors; these can be used for inference and prediction at test time, in a way that accounts for the uncertainty in the available sparse training data. Specifically, to allow for our approach to best exploit the merits of the <inline-formula> <tex-math notation="LaTeX">$t$ </tex-math></inline-formula>-exponential family, our method considers a new <inline-formula> <tex-math notation="LaTeX">$t$ </tex-math></inline-formula>-divergence measure, which generalizes the concept of the Kullback–Leibler divergence. We perform an extensive experimental evaluation of our approach, using challenging language modeling benchmarks, and illustrate its superiority over existing state-of-the-art techniques.

[1]  J. Naudts Generalized thermostatistics and mean-field theory , 2002, cond-mat/0211444.

[2]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[3]  D. Rubin,et al.  ML ESTIMATION OF THE t DISTRIBUTION USING EM AND ITS EXTENSIONS, ECM AND ECME , 1999 .

[4]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[5]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.

[6]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[7]  Sotirios Chatzis,et al.  Signal Modeling and Classification Using a Robust Latent Space Model Based on $t$ Distributions , 2008, IEEE Transactions on Signal Processing.

[8]  C. Tsallis,et al.  Student's t- and r-distributions: Unified derivation from an entropic variational principle , 1997 .

[9]  C. Tsallis,et al.  The role of constraints within generalized nonextensive statistics , 1998 .

[10]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[11]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[12]  Andrzej S. Kosinski,et al.  A procedure for the detection of multivariate outliers , 1998 .

[13]  Yuan Qi,et al.  t-divergence Based Approximate Inference , 2011, NIPS.

[14]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[15]  J. Naudts Deformed exponentials and logarithms in generalized thermostatistics , 2002, cond-mat/0203489.

[16]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[17]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[18]  Sotirios Chatzis,et al.  A variational Bayesian methodology for hidden Markov models utilizing Student's-t mixtures , 2011, Pattern Recognit..

[19]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[20]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[21]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[22]  Sotirios Chatzis,et al.  Hidden Markov Models with Nonelliptically Contoured State Densities , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Christopher M. Bishop,et al.  Robust Bayesian Mixture Modelling , 2005, ESANN.

[24]  G. Seth Psychology of Language , 1968, Nature.

[25]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[26]  M. West On scale mixtures of normal distributions , 1987 .

[27]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[28]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[29]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[30]  Sotirios Chatzis,et al.  Asymmetric deep generative models , 2017, Neurocomputing.

[31]  Glendon Ralph Pugh AN ANALYSIS OF THE LANCZOS GAMMA APPROXIMATION , 2004 .

[32]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[33]  Hagai Attias,et al.  A Variational Bayesian Framework for Graphical Models , 1999 .

[34]  C. Tsallis Possible generalization of Boltzmann-Gibbs statistics , 1988 .

[35]  Sotirios Chatzis,et al.  Robust Sequential Data Modeling Using an Outlier Tolerant Hidden Markov Model , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  J. Naudts Estimators, escort probabilities, and phi-exponential families in statistical physics , 2004, math-ph/0402005.