Confusion2Vec: towards enriching vector space word representations with representational ambiguities

Word vector representations are a crucial part of Natural Language Processing (NLP) and Human Computer Interaction. In this paper, we propose a novel word vector representation, Confusion2Vec, motivated from the human speech production and perception that encodes representational ambiguity. Humans employ both acoustic similarity cues and contextual cues to decode information and we focus on a model that incorporates both sources of information. The representational ambiguity of acoustics, which manifests itself in word confusions, is often resolved by both humans and machines through contextual cues. A range of representational ambiguities can emerge in various domains further to acoustic perception, such as morphological transformations, paraphrasing for NLP tasks like machine translation etc. In this work, we present a case study in application to Automatic Speech Recognition (ASR), where the word confusions are related to acoustic similarity. We present several techniques to train an acoustic perceptual similarity representation ambiguity. We term this Confusion2Vec and learn on unsupervised-generated data from ASR confusion networks or lattice-like structures. Appropriate evaluations for the Confusion2Vec are formulated for gauging acoustic similarity in addition to semantic-syntactic and word similarity evaluations. The Confusion2Vec is able to model word confusions efficiently, without compromising on the semantic-syntactic word relations, thus effectively enriching the word vector space with extra task relevant ambiguity information. We provide an intuitive exploration of the 2-dimensional Confusion2Vec space using Principal Component Analysis of the embedding and relate to semantic, syntactic and acoustic relationships. The potential of Confusion2Vec in the utilization of uncertainty present in lattices is demonstrated through small examples relating to ASR error correction.

[1]  James R. Glass,et al.  Open-Vocabulary Spoken Utterance Retrieval using Confusion Networks , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[2]  Martine Adda-Decker,et al.  Combining Continuous Word Representation and Prosodic Features for ASR Error Prediction , 2015, SLSP.

[3]  Richard M. Schwartz,et al.  Combining Outputs from Multiple Machine Translation Systems , 2007, NAACL.

[4]  Haoqi Li,et al.  Learning from past mistakes: improving automatic speech recognition output via noisy-clean phrase context modeling , 2018, APSIPA Transactions on Signal and Information Processing.

[5]  Hitoshi Isahara,et al.  An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging , 2009, ACL/IJCNLP.

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Panayiotis G. Georgiou,et al.  Automatic speech recognition system channel modeling , 2010, INTERSPEECH.

[8]  Gökhan Tür,et al.  Beyond ASR 1-best: Using word confusion networks in spoken language understanding , 2006, Comput. Speech Lang..

[9]  Masataka Goto,et al.  Speech repair: quick error correction just by using selection operation for speech input interfaces , 2005, INTERSPEECH.

[10]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[11]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[12]  Richard Zens,et al.  Speech Translation by Confusion Network Decoding , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[13]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[14]  Wenpeng Yin,et al.  Learning Word Meta-Embeddings , 2016, ACL.

[15]  Timothy J. Hazen,et al.  Retrieval and browsing of spoken content , 2008, IEEE Signal Processing Magazine.

[16]  Alon Lavie,et al.  Janus-III: speech-to-speech translation in multiple languages , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Philip C. Woodland,et al.  Combining Information Sources for Confidence Estimation with CRF Models , 2011, INTERSPEECH.

[18]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[19]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[20]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[21]  Hermann Ney,et al.  Computing Consensus Translation for Multiple Machine Translation Systems Using Enhanced Hypothesis Alignment , 2006, EACL.

[22]  Thomas Schaaf,et al.  Estimating confidence using word lattices , 1997, EUROSPEECH.

[23]  Yongqiang Wang,et al.  Efficient lattice rescoring using recurrent neural network language models , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  William J. Byrne,et al.  Statistical Phrase-Based Speech Translation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[25]  Phil Blunsom,et al.  Compositional Morphology for Word Representations and Language Modelling , 2014, ICML.

[26]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[27]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[28]  Gökhan Tür,et al.  Improving spoken language understanding using word confusion networks , 2002, INTERSPEECH.

[29]  Peter Glöckner,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2013 .

[30]  Ryan Cotterell,et al.  Morphological Word-Embeddings , 2019, NAACL.

[31]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[32]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[33]  Marta R. Costa-jussà,et al.  Analysis of Statistical and Morphological Classes to Generate Weigthed Reordering Hypotheses on a Statistical Machine Translation System , 2007, WMT@ACL.

[34]  Jan Niehues,et al.  A POS-Based Model for Long-Range Reorderings in SMT , 2009, WMT@EACL.

[35]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[36]  Arianna Bisazza,et al.  FBK at WMT 2010: Word Lattices for Morphological Reduction and Chunk-Based Reordering , 2010, WMT@ACL.

[37]  Masafumi Nishimura,et al.  Training of error-corrective model for ASR without using audio data , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Georg Heigold,et al.  Word embeddings for speech recognition , 2014, INTERSPEECH.

[39]  Brian Roark,et al.  Semi-supervised discriminative language modeling for Turkish ASR , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Radu Soricut,et al.  Unsupervised Morphology Induction Using Word Embeddings , 2015, NAACL.

[41]  Murat Saraclar,et al.  Performance Comparison of Training Algorithms for Semi-Supervised Discriminative Language Modeling , 2012, INTERSPEECH.

[42]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[43]  Yannick Estève,et al.  Word embeddings combination and neural networks for robustness in ASR error detection , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[44]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[45]  Chris Quirk,et al.  Monolingual Machine Translation for Paraphrase Generation , 2004, EMNLP.

[46]  Richard M. Schwartz,et al.  Improved Word-Level System Combination for Machine Translation , 2007, ACL.

[47]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[48]  Chris Dyer,et al.  Using a maximum entropy model to build segmentation lattices for MT , 2009, NAACL.

[49]  Yidong Chen,et al.  Lattice-to-sequence attentional Neural Machine Translation models , 2018, Neurocomputing.

[50]  Koray Kavukcuoglu,et al.  Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[51]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[52]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[53]  Christopher J. Dyer,et al.  The “Noisier Channel”: Translation from Morphologically Complex Languages , 2007, WMT@ACL.

[54]  Paul Deléglise,et al.  Acoustic Word Embeddings for ASR Error Detection , 2016, INTERSPEECH.

[55]  Dong Wang,et al.  Document classification with distributions of word vectors , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[56]  Brian Roark,et al.  Phrasal Cohort Based Unsupervised Discriminative Language Modeling , 2012, INTERSPEECH.

[57]  Tie-Yan Liu,et al.  Co-learning of Word Representations and Morpheme Representations , 2014, COLING.

[58]  Thorsten Joachims,et al.  Evaluation methods for unsupervised word embeddings , 2015, EMNLP.

[59]  Smaranda Muresan,et al.  Generalizing Word Lattice Translation , 2008, ACL.

[60]  Lin-Shan Lee,et al.  Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-to-Sequence Autoencoder , 2016, INTERSPEECH.

[61]  Karen Livescu,et al.  Multi-view Recurrent Neural Acoustic Word Embeddings , 2016, ICLR.

[62]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[63]  Tanja Schultz,et al.  Using word latice information for a tighter coupling in speech translation systems , 2004, INTERSPEECH.

[64]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[65]  Hermann Ney,et al.  Phrase Model Training for Statistical Machine Translation with Word Lattices of Preprocessing Alternatives , 2012, WMT@NAACL-HLT.

[66]  Haihua Xu,et al.  Minimum Bayes Risk decoding and system combination based on a recursion for edit distance , 2011, Comput. Speech Lang..

[67]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[68]  Philipp Koehn,et al.  Word Lattices for Multi-Source Translation , 2009, EACL.

[69]  Masao Utiyama,et al.  Paraphrase Lattice for Statistical Machine Translation , 2010, ACL.

[70]  Aren Jansen,et al.  Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[71]  Brian Roark,et al.  Hallucinated n-best lists for discriminative language modeling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[72]  Hermann Ney,et al.  Lattice decoding and rescoring with long-Span neural network language models , 2014, INTERSPEECH.

[73]  Manaal Faruqui,et al.  Improving Vector Space Word Representations Using Multilingual Correlation , 2014, EACL.

[74]  Wang Ling,et al.  Two/Too Simple Adaptations of Word2Vec for Syntax Problems , 2015, NAACL.

[75]  Yun Zhu,et al.  Support vector machines and Word2vec for text classification with semantic features , 2015, 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC).

[76]  Hermann Ney,et al.  Cross-Site and Intra-Site ASR System Combination: Comparisons on Lattice and 1-Best Methods , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[77]  Zhiyuan Liu,et al.  Joint Learning of Character and Word Embeddings , 2015, IJCAI.

[78]  Mari Ostendorf,et al.  Using syntactic and confusion network structure for out-of-vocabulary word detection , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[79]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[80]  Matthias Sperber,et al.  Neural Lattice-to-Sequence Models for Uncertain Inputs , 2017, EMNLP.

[81]  Andrew Y. Ng,et al.  Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[82]  Alexandre Allauzen Error detection in confusion network , 2007, INTERSPEECH.

[83]  Omer Levy,et al.  Dependency-Based Word Embeddings , 2014, ACL.

[84]  Philip Resnik,et al.  A formal model of ambiguity and its applications in machine translation , 2010 .

[85]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[86]  Jian Xue,et al.  Improved confusion network algorithm and shortest path search from word lattice , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[87]  Ariya Rastrow,et al.  LatticeRnn: Recurrent Neural Networks Over Lattices , 2016, INTERSPEECH.

[88]  Rongrong Ji,et al.  Lattice-Based Recurrent Neural Network Encoders for Neural Machine Translation , 2016, AAAI.

[89]  Karen Livescu,et al.  Deep convolutional acoustic word embeddings using word-pair side information , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[90]  Hermann Ney,et al.  On the integration of speech recognition and statistical machine translation , 2005, INTERSPEECH.

[91]  Graham Neubig,et al.  Neural Lattice Language Models , 2018, TACL.