TraDE: Transformers for Density Estimation

We present TraDE, an attention-based architecture for auto-regressive density estimation. In addition to a Maximum Likelihood loss we employ a Maximum Mean Discrepancy (MMD) two-sample loss to ensure that samples from the estimate resemble the training data. The use of attention means that the model need not retain conditional sufficient statistics during the process beyond what is needed for each covariate. TraDE performs significantly better than existing approaches such differentiable flow based estimators on standard tabular and image-based benchmarks in terms of the log-likelihood on held out data. TraDE works well wide range of tasks that includes classification methods to ascertain the quality of generated samples, out of distribution sample detection, and handling outliers in the training data.

[1]  R. Fortet,et al.  Convergence de la répartition empirique vers la répartition théorique , 1953 .

[2]  H. Daniels,et al.  The Asymptotic Efficiency of a Maximum Likelihood Estimator , 1961 .

[3]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[4]  J. Besag Efficiency of pseudolikelihood estimation for simple Gaussian fields , 1977 .

[5]  Bernard W. Silverman,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[6]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[7]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[8]  G. Wahba,et al.  Smoothing spline ANOVA for exponential families, with application to the Wisconsin Epidemiological Study of Diabetic Retinopathy : the 1994 Neyman Memorial Lecture , 1995 .

[9]  Hava T. Siegelmann,et al.  On the Computational Power of Neural Nets , 1995, J. Comput. Syst. Sci..

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  A. Müller Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[12]  Jitendra Malik,et al.  A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[13]  Ingo Steinwart,et al.  On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[14]  Dudley,et al.  Real Analysis and Probability: Measurability: Borel Isomorphism and Analytic Sets , 2002 .

[15]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[16]  Miroslav Dudík,et al.  A maximum entropy approach to species distribution modeling , 2004, ICML.

[17]  L. Wasserman All of Nonparametric Statistics , 2005 .

[18]  Alexander J. Smola,et al.  Unifying Divergence Minimization and Statistical Inference Via Convex Duality , 2006, COLT.

[19]  Zaïd Harchaoui,et al.  A Fast, Consistent Kernel Two-Sample Test , 2009, NIPS.

[20]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[21]  Hugo Larochelle,et al.  The Neural Autoregressive Distribution Estimator , 2011, AISTATS.

[22]  Vincent Y. F. Tan,et al.  Learning Latent Tree Graphical Models , 2010, J. Mach. Learn. Res..

[23]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[24]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[25]  Shankar Vembu,et al.  Chemical gas sensor drift compensation using classifier ensembles , 2012 .

[26]  Po-Ling Loh,et al.  Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima , 2013, J. Mach. Learn. Res..

[27]  E. Tabak,et al.  A Family of Nonparametric Density Estimation Algorithms , 2013 .

[28]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[29]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[30]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[31]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[32]  Sean C. Bendall,et al.  Conditional density-based analysis of T cell signaling in single-cell data , 2014, Science.

[33]  Max Welling,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS 2015.

[34]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[35]  Hugo Larochelle,et al.  MADE: Masked Autoencoder for Distribution Estimation , 2015, ICML.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Yoshua Bengio,et al.  NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[38]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[40]  Hugo Larochelle,et al.  Neural Autoregressive Distribution Estimation , 2016, J. Mach. Learn. Res..

[41]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[42]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[43]  Bharath K. Sriperumbudur On the optimal estimation of probability measures in weak and strong topologies , 2013, 1310.8240.

[44]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[45]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[46]  Jascha Sohl-Dickstein,et al.  Capacity and Trainability in Recurrent Neural Networks , 2016, ICLR.

[47]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[48]  Iain Murray,et al.  Masked Autoregressive Flow for Density Estimation , 2017, NIPS.

[49]  Max Welling,et al.  Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[50]  Yiming Yang,et al.  MMD GAN: Towards Deeper Understanding of Moment Matching Network , 2017, NIPS.

[51]  David Lopez-Paz,et al.  Revisiting Classifier Two-Sample Tests , 2016, ICLR.

[52]  Barnabás Póczos,et al.  Transformation Autoregressive Networks , 2018, ICML.

[53]  Max Welling,et al.  Sylvester Normalizing Flows for Variational Inference , 2018, UAI.

[54]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[55]  Alexandre Lacoste,et al.  Neural Autoregressive Flows , 2018, ICML.

[56]  Iain Murray,et al.  Neural Spline Flows , 2019, Neural Information Processing Systems.

[57]  Wenhu Chen,et al.  Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting , 2019, NeurIPS.

[58]  Nicola De Cao,et al.  Block Neural Autoregressive Flow , 2019, UAI.

[59]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[60]  David Duvenaud,et al.  FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models , 2018, ICLR.

[61]  Alexander J. Smola,et al.  Language Models with Transformers , 2019, ArXiv.

[62]  Charlie Nash,et al.  Autoregressive Energy Machines , 2019, ICML.

[63]  Sashank J. Reddi,et al.  Are Transformers universal approximators of sequence-to-sequence functions? , 2019, ICLR.

[64]  Yang I. Li,et al.  Alignment of single-cell RNA-seq samples without overcorrection using kernel density matching , 2020, bioRxiv.

[65]  Syama Sundar Rangapuram,et al.  Neural forecasting: Introduction and literature overview , 2020, ArXiv.

[66]  Xue Ben,et al.  Deep Transformer Models for Time Series Forecasting: The Influenza Prevalence Case , 2020, ArXiv.

[67]  Eric Nalisnick,et al.  Normalizing Flows for Probabilistic Modeling and Inference , 2019, J. Mach. Learn. Res..