Distributed representation-based spoken word sense induction

Spoken Term Detection (STD) or Keyword Search (KWS) techniques can locate keyword instances but do not differentiate between meanings. Spoken Word Sense Induction (SWSI) differentiates target instances by clustering according to context, providing a more useful result. In this paper we present a fully unsupervised SWSI approach based on distributed representations of spoken utterances. We compare this approach to several others, including the state-of-the-art Hierarchical Dirichlet Process (HDP). To determine how ASR performance affects SWSI, we used three different levels of Word Error Rate (WER), 40%, 20% and 0%; 40% WER is representative of online video, 0% of text. We show that the distributed representation approach outperforms all other approaches, regardless of the WER. Although LDA-based approaches do well on clean data, they degrade significantly with WER. Paradoxically, lower WER does not guarantee better SWSI performance, due to the influence of common locutions.

[1]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[2]  Florian Metze,et al.  Distributed learning of multilingual DNN feature extractors using GPUs , 2014, INTERSPEECH.

[3]  Ted Pedersen Duluth : Word Sense Induction Applied to Web Page Clustering , 2013, SemEval@NAACL-HLT.

[4]  Florian Metze,et al.  Towards speaker adaptive training of deep neural network acoustic models , 2014, INTERSPEECH.

[5]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[6]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[7]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[8]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[9]  Ganesh Ramakrishnan,et al.  SATTY : Word Sense Induction Application in Web Search Clustering , 2013, SemEval@NAACL-HLT.

[10]  Patrick Pantel,et al.  Discovering word senses from text , 2002, KDD.

[11]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[12]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[13]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[14]  Roberto Navigli,et al.  SemEval-2013 Task 11: Word Sense Induction and Disambiguation within an End-User Application , 2013, SemEval@NAACL-HLT.

[15]  Alexander I. Rudnicky,et al.  Using conversational word bursts in spoken term detection , 2013, INTERSPEECH.

[16]  Florian Metze,et al.  Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training , 2013, INTERSPEECH.

[17]  Florian Metze,et al.  Deep maxout networks for low-resource speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[18]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[19]  Timothy Baldwin,et al.  unimelb: Topic Modelling-based Word Sense Induction for Web Snippet Clustering , 2013, SemEval@NAACL-HLT.

[20]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[21]  Danqi Chen,et al.  Reasoning With Neural Tensor Networks for Knowledge Base Completion , 2013, NIPS.

[22]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[23]  Christiane Fellbaum,et al.  Augmenting WordNet for Deep Understanding of Text , 2008, STEP.

[24]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[25]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[26]  L. Hubert,et al.  Comparing partitions , 1985 .

[27]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[28]  Alexander G. Hauptmann,et al.  Instructional Videos for Unsupervised Harvesting and Learning of Action Examples , 2014, ACM Multimedia.