Connectionist multivariate density-estimation and its application to speech synthesis

Autoregressive models factorize a multivariate joint probability distribution into a product of one-dimensional conditional distributions. The variables are assigned an ordering, and the conditional distribution of each variable modelled using all variables preceding it in that ordering as predictors. Calculating normalized probabilities and sampling has polynomial computational complexity under autoregressive models. Moreover, binary autoregressive models based on neural networks obtain statistical performances similar to that of some intractable models, like restricted Boltzmann machines, on several datasets. The use of autoregressive probability density estimators based on neural networks to model real-valued data, while proposed before, has never been properly investigated and reported. In this thesis we extend the formulation of neural autoregressive distribution estimators (NADE) to real-valued data; a model we call the real-valued neural autoregressive density estimator (RNADE). Its statistical performance on several datasets, including visual and auditory data, is reported and compared to that of other models. RNADE obtained higher test likelihoods than other tractable models, while retaining all the attractive computational properties of autoregressive models. However, autoregressive models are limited by the ordering of the variables inherent to their formulation. Marginalization and imputation tasks can only be solved analytically if the missing variables are at the end of the ordering. We present a new training technique that obtains a set of parameters that can be used for any ordering of the variables. By choosing a model with a convenient ordering of the dimensions at test time, it is possible to solve any marginalization and imputation tasks analytically. The same training procedure also makes it practical to train NADEs and RNADEs with several hidden layers. The resulting deep and tractable models display higher test likelihoods than the equivalent one-hidden-layer models for all the datasets tested. Ensembles of NADEs or RNADEs can be created inexpensively by combining models that share their parameters but differ in the ordering of the variables. These ensembles of autoregressive models obtain state-of-the-art statistical performances for several datasets. Finally, we demonstrate the application of RNADE to speech synthesis, and

[1]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[2]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[3]  Yannis Stylianou,et al.  Evaluating the intelligibility benefit of speech modifications in known noise conditions , 2013, Speech Commun..

[4]  Geoffrey E. Hinton,et al.  Modeling pixel means and covariances using factorized third-order boltzmann machines , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Ruslan Salakhutdinov,et al.  Importance Weighted Autoencoders , 2015, ICLR.

[6]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[7]  Daan Wierstra,et al.  Deep AutoRegressive Networks , 2013, ICML.

[8]  Brendan J. Frey,et al.  Graphical Models for Machine Learning and Digital Communication , 1998 .

[9]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[10]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[11]  Pascal Vincent,et al.  Generalized Denoising Auto-Encoders as Generative Models , 2013, NIPS.

[12]  Heiga Zen,et al.  An introduction of trajectory model into HMM-based speech synthesis , 2004, SSW.

[13]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[14]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[15]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Yee Whye Teh,et al.  Mixed Cumulative Distribution Networks , 2010, AISTATS.

[17]  Zoubin Ghahramani,et al.  Proceedings of the 24th international conference on Machine learning , 2007, ICML 2007.

[18]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[19]  David J. C. MacKay,et al.  Bayesian methods for supervised neural networks , 1998 .

[20]  Tapani Raiko,et al.  Improved Learning of Gaussian-Bernoulli Restricted Boltzmann Machines , 2011, ICANN.

[21]  Geoffrey E. Hinton,et al.  A New Learning Algorithm for Mean Field Boltzmann Machines , 2002, ICANN.

[22]  Constantin F. Aliferis,et al.  Causal Feature Selection , 2007 .

[23]  Yair Weiss,et al.  From learning models of natural image patches to whole image restoration , 2011, 2011 International Conference on Computer Vision.

[24]  Geoffrey E. Hinton,et al.  The Helmholtz Machine , 1995, Neural Computation.

[25]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[26]  Michael I. Jordan,et al.  Hidden Markov Decision Trees , 1996, NIPS.

[27]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[28]  Simon King,et al.  Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech , 2014, INTERSPEECH.

[29]  Sham M. Kakade,et al.  Learning mixtures of spherical gaussians: moment methods and spectral decompositions , 2012, ITCS '13.

[30]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[31]  Yoshua Bengio Discussion of "The Neural Autoregressive Distribution Estimator" , 2011, AISTATS.

[32]  Brendan J. Frey,et al.  Does the Wake-sleep Algorithm Produce Good Density Estimators? , 1995, NIPS.

[33]  Matthias Bethge,et al.  In All Likelihood, Deep Belief Is Not Enough , 2010, J. Mach. Learn. Res..

[34]  Frank K. Soong,et al.  On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Ruslan Salakhutdinov,et al.  On the quantitative analysis of deep belief networks , 2008, ICML '08.

[36]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[37]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[38]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[39]  Tapani Raiko,et al.  Iterative Neural Autoregressive Distribution Estimator NADE-k , 2014, NIPS.

[40]  Fabrizio Durante,et al.  Copula Theory and Its Applications , 2010 .

[41]  Haifeng Li,et al.  Sequence error (SE) minimization training of neural network for voice conversion , 2014, INTERSPEECH.

[42]  Ben Taskar,et al.  Learning structured prediction models: a large margin approach , 2005, ICML.

[43]  Geoffrey E. Hinton,et al.  The EM algorithm for mixtures of factor analyzers , 1996 .

[44]  Yann LeCun,et al.  Learning Representations by Maximizing Compression , 2011, ArXiv.

[45]  Yoshua Bengio,et al.  Deep Generative Stochastic Networks Trainable by Backprop , 2013, ICML.

[46]  Samy Bengio,et al.  Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks , 1999, NIPS.

[47]  Hugo Larochelle,et al.  The Neural Autoregressive Distribution Estimator , 2011, AISTATS.

[48]  L. M. M.-T. Theory of Probability , 1929, Nature.

[49]  Cassia Valentini-Botinhao,et al.  Modelling acoustic feature dependencies with artificial neural networks: Trajectory-RNADE , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[51]  Dorothy T. Thayer,et al.  EM algorithms for ML factor analysis , 1982 .

[52]  Tom M. Mitchell,et al.  The Need for Biases in Learning Generalizations , 2007 .

[53]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[54]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[55]  Cosma Rohilla Shalizi,et al.  Philosophy and the practice of Bayesian statistics. , 2010, The British journal of mathematical and statistical psychology.

[56]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[57]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[58]  Yair Weiss,et al.  "Natural Images, Gaussian Mixtures and Dead Leaves" , 2012, NIPS.

[59]  Hugo Larochelle,et al.  A Neural Autoregressive Topic Model , 2012, NIPS.

[60]  T. Robinson Simple Lossless and Near-lossless Waveform Compression , 1994 .

[61]  Jitendra Malik,et al.  A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[62]  Karol Gregor,et al.  Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[63]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[64]  S. Srihari Mixture Density Networks , 1994 .

[65]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[66]  D. Mackay,et al.  Bayesian neural networks and density networks , 1995 .

[67]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[68]  Geoffrey E. Hinton,et al.  Deep Belief Networks for phone recognition , 2009 .

[69]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[70]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[71]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[72]  R. Koenker,et al.  Regression Quantiles , 2007 .

[73]  Yoshua Bengio,et al.  Reweighted Wake-Sleep , 2014, ICLR.

[74]  Ruslan Salakhutdinov,et al.  Evaluating probabilities under high-dimensional latent variable models , 2008, NIPS.

[75]  Geoffrey E. Hinton,et al.  The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[76]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[77]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[78]  Geoffrey E. Hinton,et al.  Varieties of Helmholtz Machine , 1996, Neural Networks.

[79]  Prabhat,et al.  Scalable Bayesian Optimization Using Deep Neural Networks , 2015, ICML.

[80]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[81]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[82]  R. Fisher,et al.  On the Mathematical Foundations of Theoretical Statistics , 1922 .

[83]  Volker Tresp,et al.  Improved Gaussian Mixture Density Estimates Using Bayesian Penalty Terms and Network Averaging , 1995, NIPS.

[84]  Li Yao,et al.  Multimodal Transitions for Generative Stochastic Networks , 2013, ICLR.

[85]  Keiichi Tokuda,et al.  The blizzard challenge - 2005: evaluating corpus-based speech synthesis on common datasets , 2005, INTERSPEECH.

[86]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[87]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[88]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[89]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[90]  Ruslan Salakhutdinov,et al.  Annealing between distributions by averaging moments , 2013, NIPS.

[91]  Nan Wang,et al.  An analysis of Gaussian-binary restricted Boltzmann machines for natural images , 2012, ESANN.

[92]  Heiga Zen,et al.  Autoregressive Models for Statistical Parametric Speech Synthesis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[93]  Padhraic Smyth,et al.  Linearly Combining Density Estimators via Stacking , 1999, Machine Learning.

[94]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[95]  Geoffrey E. Hinton,et al.  Deep Mixtures of Factor Analysers , 2012, ICML.

[96]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[97]  Georg Heigold,et al.  An empirical study of learning rates in deep neural networks for speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[98]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[99]  Simon King,et al.  Modelling the uncertainty in recovering articulation from acoustics , 2003, Comput. Speech Lang..

[100]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[101]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[102]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[103]  Tom Schaul,et al.  No more pesky learning rates , 2012, ICML.

[104]  Hugo Larochelle,et al.  A Deep and Tractable Density Estimator , 2013, ICML.

[105]  Hugo Larochelle,et al.  RNADE: The real-valued neural autoregressive density-estimator , 2013, NIPS.

[106]  M. Bethge,et al.  Mixtures of Conditional Gaussian Scale Mixtures Applied to Multiscale Image Representations , 2011, PloS one.

[107]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[108]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[109]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[110]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[111]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[112]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[113]  Heiga Zen,et al.  Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[114]  Heiga Zen,et al.  The Effect of Using Normalized Models in Statistical Speech Synthesis , 2011, INTERSPEECH.

[115]  Wray L. Buntine Theory Refinement on Bayesian Networks , 1991, UAI.

[116]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[117]  Geoffrey E. Hinton,et al.  Training Recurrent Neural Networks , 2013 .

[118]  P. Spreij Probability and Measure , 1996 .

[119]  Li-Rong Dai,et al.  Spectral modeling using neural autoregressive distribution estimators for statistical parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[120]  Marie desJardins,et al.  Evaluation and selection of biases in machine learning , 1995, Machine Learning.

[121]  K. Perez Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment , 2014 .

[122]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[123]  Benigno Uría A Deep Belief Network for the Acoustic-Articulatory Inversion Mapping Problem , 2011 .

[124]  Yoshua Bengio,et al.  A Spike and Slab Restricted Boltzmann Machine , 2011, AISTATS.