Audio Word2vec: Sequence-to-Sequence Autoencoding for Unsupervised Learning of Audio Segmentation and Representation

In text, word2vec transforms each word into a fixed-size vector used as the basic component in applications of natural language processing. Given a large collection of unannotated audio, audio word2vec can also be trained in an unsupervised way using a sequence-to-sequence autoencoder (SA). These vector representations are shown to effectively describe the sequential phonetic structures of the audio segments. In this paper, we further extend this research in the following two directions. First, we disentangle phonetic information and speaker information from the SA vector representations. Second, we extend audio word2vec from the word level to the utterance level by proposing a new segmental audio word2vec in which unsupervised spoken word boundary segmentation and audio word2vec are jointly learned and mutually enhanced, and utterances are directly represented as sequences of vectors carrying phonetic information. This is achieved by means of a segmental sequence-to-sequence autoencoder, in which a segmentation gate trained with reinforcement learning is inserted in the encoder.

[1]  Yu Zhang,et al.  Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[2]  Brian Kingsbury,et al.  End-to-end ASR-free keyword search from speech , 2017, ICASSP.

[3]  Tanja Schultz,et al.  Globalphone: a multilingual speech and text database developed at karlsruhe university , 2002, INTERSPEECH.

[4]  Bin Ma,et al.  An acoustic segment modeling approach to query-by-example spoken term detection , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Hugo Van hamme,et al.  Fast word acquisition in an NMF-based learning framework , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Lin-Shan Lee,et al.  Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-to-Sequence Autoencoder , 2016, INTERSPEECH.

[7]  Yifan Gong,et al.  Unsupervised adaptation with domain separation networks for robust speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[10]  Lin-Shan Lee,et al.  Phonetic-and-Semantic Embedding of Spoken words with Applications in Spoken Content Retrieval , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[11]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[12]  Karen Livescu,et al.  Multi-view Recurrent Neural Acoustic Word Embeddings , 2016, ICLR.

[13]  Yu Zhang,et al.  Learning Latent Representations for Speech Generation and Transformation , 2017, INTERSPEECH.

[14]  Kenneth Ward Church,et al.  Towards spoken term discovery at scale with zero resources , 2010, INTERSPEECH.

[15]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[16]  Andrew L. Maas,et al.  Word-level Acoustic Modeling with Convolutional Vector Regression , 2012 .

[17]  Aren Jansen,et al.  Exploiting Discriminative Point Process Models for Spoken Term Detection , 2012, INTERSPEECH.

[18]  Adam Lopez,et al.  Towards speech-to-text translation without speech recognition , 2017, EACL.

[19]  Daniel Jurafsky,et al.  A Hierarchical Neural Autoencoder for Paragraphs and Documents , 2015, ACL.

[20]  Erhardt Barth,et al.  A Hybrid Convolutional Variational Autoencoder for Text Generation , 2017, EMNLP.

[21]  Hung-yi Lee,et al.  Gate Activation Signal Analysis for Gated Recurrent Neural Networks and its Correlation with Phoneme Boundaries , 2017, INTERSPEECH.

[22]  Karen Livescu,et al.  Deep convolutional acoustic word embeddings using word-pair side information , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Haizhou Li,et al.  An Iterative Approach to Model Merging for Speech Pattern Discovery , 2011 .

[25]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[26]  Patrick Kenny,et al.  Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification , 2009, INTERSPEECH.

[27]  Aren Jansen,et al.  Towards Learning Semantic Audio Representations from Unlabeled Data , 2017 .

[28]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[29]  Yu Zhang,et al.  Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data , 2017, NIPS.

[30]  Hung-yi Lee,et al.  Language Transfer of Audio Word2Vec: Learning Audio Segment Representations Without Target Language Data , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Aren Jansen,et al.  Segmental acoustic indexing for zero resource keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Lin-Shan Lee,et al.  Segmental Audio Word2Vec: Representing Utterances as Sequences of Vectors with Applications in Spoken Term Detection , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Karen Livescu,et al.  An embedded segmental K-means model for unsupervised segmentation and clustering of speech , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[34]  Tara N. Sainath,et al.  Query-by-example keyword spotting using long short-term memory networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Aren Jansen,et al.  Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[36]  Majid Mirbagheri,et al.  ASR for Under-Resourced Languages From Probabilistic Transcription , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[37]  Karen Livescu,et al.  Query-by-Example Search with Discriminative Neural Acoustic Word Embeddings , 2017, INTERSPEECH.

[38]  Herman Kamper,et al.  Truly Unsupervised Acoustic Word Embeddings Using Weak Top-down Constraints in Encoder-decoder Models , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Okko Johannes Räsänen,et al.  Basic cuts revisited: Temporal segmentation of speech into phone-like units with statistical learning at a pre-linguistic level , 2014, CogSci.

[40]  Balaraman Ravindran,et al.  Diversity driven attention model for query-based abstractive summarization , 2017, ACL.

[41]  Aren Jansen,et al.  Towards Unsupervised Training of Speaker Independent Acoustic Models , 2011, INTERSPEECH.

[42]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[43]  Nobuaki Minematsu,et al.  Unsupervised optimal phoneme segmentation: Objectives, algorithm and comparisons , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[44]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[45]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[46]  Yoshua Bengio,et al.  Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.

[47]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[48]  Aren Jansen,et al.  A segmental framework for fully-unsupervised large-vocabulary speech recognition , 2016, Comput. Speech Lang..

[49]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[51]  Lin-Shan Lee,et al.  Unsupervised Discovery of Structured Acoustic Tokens With Applications to Spoken Term Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[52]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[53]  Emmanuel Dupoux,et al.  Learning Word Embeddings: Unsupervised Methods for Fixed-size Representations of Variable-length Speech Segments , 2018, INTERSPEECH.

[54]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[55]  Lin-Shan Lee,et al.  Unsupervised spoken term detection with spoken queries by multi-level acoustic patterns with varying model granularity , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[56]  Karen Livescu,et al.  Discriminative acoustic word embeddings: Tecurrent neural network-based approaches , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[57]  Frank K. Soong,et al.  A segment model based approach to speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[58]  Boon Pang Lim,et al.  Multitask Learning for Phone Recognition of Underresourced Languages Using Mismatched Transcription , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[59]  Pierre Baldi,et al.  Autoencoders, Unsupervised Learning, and Deep Architectures , 2011, ICML Unsupervised and Transfer Learning.

[60]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[61]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[62]  I-Fan Chen,et al.  A hybrid HMM/DNN approach to keyword spotting of short words , 2013, INTERSPEECH.

[63]  Lin-Shan Lee,et al.  Enhanced Spoken Term Detection Using Support Vector Machines and Weighted Pseudo Examples , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[64]  James R. Glass,et al.  Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech , 2018, INTERSPEECH.

[65]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[66]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[67]  Micha Elsner,et al.  Speech segmentation with a neural encoder model of working memory , 2017, EMNLP.

[68]  James R. Glass,et al.  Towards multi-speaker unsupervised speech pattern discovery , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[69]  Georg Heigold,et al.  Word embeddings for speech recognition , 2014, INTERSPEECH.

[70]  Giampiero Salvi,et al.  Word Discovery with Beta Process Factor Analysis , 2012, INTERSPEECH.

[71]  Aren Jansen,et al.  Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[72]  Kenneth Ward Church,et al.  A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[73]  Herbert Gish,et al.  Keyword Spotting of Arbitrary Words Using Minimal Speech Resources , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[74]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[75]  Hugo Van hamme,et al.  Discovering Phone Patterns in Spoken Utterances by Non-Negative Matrix Factorization , 2008, IEEE Signal Processing Letters.

[76]  Nicolas Usunier,et al.  Joint Learning of Speaker and Phonetic Similarities with Siamese Networks , 2016, INTERSPEECH.

[77]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.