Semantic indexing and document retrieval for personalized language modeling

The paper presents a semantic indexing and document retrieval approach for personalized language modeling to improve speech recognition accuracy for individual speakers in a lecture speech transcription task. The latent semantic indexing and paragraph vector modeling are implemented to retrieve a subset of documents from an existing background corpus relevant to the topic and speaking style of a speaker. We select a subset of text documents semantically similar to the output hypotheses from recognized speech segments in the first decoding stage. After that, a small user topic-specific language model is created from the relevant documents, interpolated with the background model, adapted to the current topic and applied during the second decoding stage. Experimental results performed for ten speakers from the database of the Slovak TEDx talks show an improvement in word error rate up to 3.03% relatively on average.

[1]  Chng Eng Siong,et al.  Unsupervised Language Model Adaptation by Data Selection for Speech Recognition , 2017, ACIIDS.

[2]  Jozef Juhár,et al.  Language Model Speaker Adaptation for Transcription of Slovak Parliament Proceedings , 2015, SPECOM.

[3]  W. Bruce Croft,et al.  Improving Language Estimation with the Paragraph Vector Model for Ad-hoc Retrieval , 2016, SIGIR.

[4]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[5]  Daniel Hladek,et al.  Semantically similar document retrieval framework for language model speaker adaptation , 2016, 2016 26th International Conference Radioelektronika (RADIOELEKTRONIKA).

[6]  Peter Viszlay,et al.  TEDxSK and JumpSK: A New Slovak Speech Recognition Dedicated Corpus , 2017 .

[7]  Jiafeng Guo,et al.  Analysis of the Paragraph Vector Model for Information Retrieval , 2016, ICTIR.

[8]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[9]  Wolfgang Menzel,et al.  Data Selection for IT Texts using Paragraph Vector , 2016, WMT.

[10]  Kai Yu,et al.  Paragraph vector based topic model for language model adaptation , 2015, INTERSPEECH.

[11]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[12]  Karel Jezek,et al.  Comparing Semantic Models for Evaluating Automatic Document Summarization , 2015, TSD.

[13]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[14]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[15]  Khaled M. Fouad,et al.  Personalized Semantic Retrieval and Summarization of Web Based Documents , 2013 .