论文信息 - Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Recurrent Neural Networks

Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Recurrent Neural Networks

Representing audio segments expressed with variablelength acoustic feature sequences as fixed-length feature vectors is usually needed in many speech applications, including speaker identification, audio emotion classification and spoken term detection (STD). In this paper, we apply and extend sequence-to-sequence learning framework to learn representations for audio segments without any supervision. The model we used is called Sequence-to-sequence Autoencoder (SA), which consists of two RNNs equipped with Long Short-Term Memory (LSTM) units: the first RNN acts as an encoder that maps the input sequence into a vector representation of fixed dimensionality, and the second RNN acts as a decoder that maps the representation back to the input sequence. The two RNNs are then jointly trained by minimizing the reconstruction error. We further propose Denoising Sequence-to-sequence Autoencoder (DSA) that improves the learned representations. The vector representations learned by SA and DSA are shown to be very helpful for query-by-example STD. The experimental results have shown that the proposed models achieved better retrieval performance than using audio segment representation designed heuristically and the classical Dynamic Time Warping (DTW) approach.

[1] Hermann Ney,et al. Fast and Robust Training of Recurrent Neural Networks for Offline Handwriting Recognition , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[2] Aren Jansen,et al. Segmental acoustic indexing for zero resource keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Jürgen Schmidhuber,et al. LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[4] Jürgen Schmidhuber,et al. Recurrent nets that time and count , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[5] Razvan Pascanu,et al. Theano: new features and speed improvements , 2012, ArXiv.

[6] Alex Graves,et al. Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[7] Andrew W. Senior,et al. Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[8] Lin-Shan Lee,et al. Unsupervised spoken term detection with spoken queries by multi-level acoustic patterns with varying model granularity , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[10] Julian Togelius,et al. Evolving Memory Cell Structures for Sequence Learning , 2009, ICANN.

[11] S. Chiba,et al. Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[12] Quoc V. Le,et al. Semi-supervised Sequence Learning , 2015, NIPS.

[13] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[14] Nitish Srivastava,et al. Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[15] Daniel Jurafsky,et al. A Hierarchical Neural Autoencoder for Paragraphs and Documents , 2015, ACL.

[16] Björn W. Schuller,et al. The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[17] Tara N. Sainath,et al. Query-by-example keyword spotting using long short-term memory networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Geoffrey E. Hinton,et al. Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[19] Pascal Vincent,et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[20] I-Fan Chen,et al. A hybrid HMM/DNN approach to keyword spotting of short words , 2013, INTERSPEECH.

[21] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[22] Yoshua Bengio,et al. Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[23] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[24] Jürgen Schmidhuber,et al. Training Recurrent Networks by Evolino , 2007, Neural Computation.

[25] Lin-Shan Lee,et al. Enhanced Spoken Term Detection Using Support Vector Machines and Weighted Pseudo Examples , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[26] Pierre Baldi,et al. Autoencoders, Unsupervised Learning, and Deep Architectures , 2011, ICML Unsupervised and Transfer Learning.

[27] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[28] Aren Jansen,et al. Exploiting Discriminative Point Process Models for Spoken Term Detection , 2012, INTERSPEECH.

[29] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[30] Patrick Kenny,et al. Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification , 2009, INTERSPEECH.

[31] Georg Heigold,et al. Word embeddings for speech recognition , 2014, INTERSPEECH.

[32] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33] Yoshua Bengio,et al. Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.