Data augmentation and feature extraction using variational autoencoder for acoustic modeling

A data augmentation and feature extraction method using a variational autoencoder (VAE) for acoustic modeling is described. A VAE is a generative model based on variational Bayesian learning using a deep learning framework. A VAE can extract latent values its input variables to generate new information. VAEs are widely used to generate pictures and sentences. In this paper, a VAE is applied to speech corpus data augmentation and feature vector extraction from speech for acoustic modeling. First, the size of a speech corpus is doubled by encoding latent variables extracted from original utterances using a VAE framework. The latent variables extracted from speech waveforms have latent "meanings" of the waveforms. Therefore, latent variables can be used as acoustic features for automatic speech recognition (ASR). This paper experimentally shows the effectiveness of data augmentation using a VAE framework and that latent variable-based features can be utilized in ASR.

[1]  Tara N. Sainath,et al.  Auto-encoder bottleneck features using deep belief networks , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  DeLiang Wang,et al.  Deep neural network based spectral feature mapping for robust speech recognition , 2015, INTERSPEECH.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[5]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[6]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[7]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[8]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[9]  Xiaodong Cui,et al.  Data Augmentation for Deep Neural Network Acoustic Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Kevin Duh,et al.  Automation of system building for state-of-the-art large vocabulary speech recognition using evolution strategy , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[11]  Ruslan Salakhutdinov,et al.  Importance Weighted Autoencoders , 2015, ICLR.

[12]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[13]  Mark J. F. Gales,et al.  Data augmentation for low resource languages , 2014, INTERSPEECH.

[14]  K. Maekawa CORPUS OF SPONTANEOUS JAPANESE : ITS DESIGN AND EVALUATION , 2003 .

[15]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[16]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17]  Yoshua Bengio,et al.  A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[18]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[19]  Zhe Gan,et al.  Variational Autoencoder for Deep Learning of Images, Labels and Captions , 2016, NIPS.

[20]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[21]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[22]  Jun Du,et al.  Robust speech recognition with speech enhanced deep neural networks , 2014, INTERSPEECH.