论文信息 - Multi-view Recurrent Neural Acoustic Word Embeddings

Multi-view Recurrent Neural Acoustic Word Embeddings

Recent work has begun exploring neural acoustic word embeddings---fixed-dimensional vector representations of arbitrary-length speech segments corresponding to words. Such embeddings are applicable to speech retrieval and recognition tasks, where reasoning about whole words may make it possible to avoid ambiguous sub-word representations. The main idea is to map acoustic sequences to fixed-dimensional vectors such that examples of the same word are mapped to similar vectors, while different-word examples are mapped to very different vectors. In this work we take a multi-view approach to learning acoustic word embeddings, in which we jointly learn to embed acoustic sequences and their corresponding character sequences. We use deep bidirectional LSTM embedding models and multi-view contrastive losses. We study the effect of different loss variants, including fixed-margin and cost-sensitive losses. Our acoustic word embeddings improve over previous approaches for the task of word discrimination. We also present results on other tasks that are enabled by the multi-view approach, including cross-view word discrimination and word similarity.

[1] Jeff A. Bilmes,et al. On Deep Multi-View Representation Learning , 2015, ICML.

[2] Georg Heigold,et al. Word embeddings for speech recognition , 2014, INTERSPEECH.

[3] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[4] Jonathan G. Fiscus,et al. Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[5] Tuomas Virtanen,et al. Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[6] Geoffrey E. Hinton,et al. Three new graphical models for statistical language modelling , 2007, ICML '07.

[7] Paul Deléglise,et al. Evaluation of acoustic word embeddings , 2016, RepEval@ACL.

[8] Omer Levy,et al. Published as a conference paper at ICLR 2018 S IMULATING A CTION D YNAMICS WITH N EURAL P ROCESS N ETWORKS , 2018 .

[9] Aren Jansen,et al. Segmental acoustic indexing for zero resource keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11] Brian Kingsbury,et al. End-to-end ASR-free keyword search from speech , 2017, ICASSP.

[12] Sanja Fidler,et al. Order-Embeddings of Images and Language , 2015, ICLR.

[13] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14] John J. Godfrey,et al. SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15] James R. Glass,et al. Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[16] Karen Livescu,et al. Discriminative acoustic word embeddings: Tecurrent neural network-based approaches , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[17] Yann LeCun,et al. Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[18] Xinyun Chen. Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[19] Yann LeCun,et al. Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[20] Nitish Srivastava,et al. Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[21] Aren Jansen,et al. Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[22] Honglak Lee,et al. Improved Multimodal Deep Learning with Variation of Information , 2014, NIPS.

[23] Hal Daumé,et al. Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[24] Tara N. Sainath,et al. Query-by-example keyword spotting using long short-term memory networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[26] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[27] Yoshua Bengio,et al. A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[28] Ehud Rivlin,et al. Placing search in context: the concept revisited , 2002, TOIS.

[29] Andrew L. Maas,et al. Word-level Acoustic Modeling with Convolutional Vector Regression , 2012 .

[30] Phil Blunsom,et al. Multilingual Distributed Representations without Word Alignment , 2013, ICLR 2014.

[31] Karen Livescu,et al. Deep convolutional acoustic word embeddings using word-pair side information , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[33] James R. Glass,et al. Learning Word-Like Units from Joint Audio-Visual Analysis , 2017, ACL.

[34] Emmanuel Dupoux,et al. Phonetics embedding learning with side information , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[35] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[36] Aren Jansen,et al. Rapid Evaluation of Speech Representations for Spoken Term Discovery , 2011, INTERSPEECH.

[37] Yann LeCun,et al. Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[38] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[39] Florian Metze,et al. Query by Example Search on Speech at Mediaeval 2015 , 2014, MediaEval.

[40] Felix Hill,et al. SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[41] Hung-yi Lee,et al. Unsupervised Learning of Audio Segment Representations using Sequence-to-sequence Recurrent Neural Networks , 2016 .

[42] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[43] Hang Li,et al. Convolutional Neural Network Architectures for Matching Natural Language Sentences , 2014, NIPS.

[44] James R. Glass,et al. Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[45] Kevin Gimpel,et al. Towards Universal Paraphrastic Sentence Embeddings , 2015, ICLR.

[46] Aren Jansen,et al. Unsupervised neural network based feature extraction using weak top-down constraints , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47] Quoc V. Le,et al. Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.