Speech retrieval from unsegmented finnish audio using statistical morpheme-like units for segmentation, recognition, and retrieval

This article examines the use of statistically discovered morpheme-like units for Spoken Document Retrieval (SDR). The morpheme-like units (morphs) are used both for language modeling in speech recognition and as index terms. Traditional word-based methods suffer from out-of-vocabulary words. If a word is not in the recognizer vocabulary, any occurrence of the word in speech will be missing from the transcripts. The problem is especially severe for languages with a high number of distinct word forms such as Finnish. With the morph language model, even previously unseen words can be recognized by identifying its component morphs. Similarly in information retrieval queries, complex word forms, even unseen ones, can be matched to data after segmenting them to morphs. Retrieval performance can be further improved by expanding the transcripts with alternative recognition results from confusion networks. In this article, a novel retrieval evaluation corpus consisting of unsegmented Finnish radio programs, 25 queries and corresponding human relevance assessments was constructed. Previous results on using morphs and confusion networks for Finnish SDR are confirmed and extended to the unsegmented case. As previously, using morphs or base forms as index terms yields about equal performance but combination methods, including a new one, are found to work better than either alone. Using alternative morph segmentations of the query words is found to further improve the results. Lexical similarity-based story segmentation was applied and performance using morphs, base forms, and their combinations was compared for the first time.

[1]  James R. Glass,et al.  Open-Vocabulary Spoken Utterance Retrieval using Confusion Networks , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[2]  Beth Logan,et al.  Word and sub-word indexing approaches for reducing the effects of OOV queries on spoken audio , 2002 .

[3]  Richard M. Stern,et al.  Integration of continuous speech recognition and information retrieval for mutually optimal performance , 1999 .

[4]  Ari Pirkola,et al.  Morphological typology of languages for IR , 2001, J. Documentation.

[5]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[6]  Mathias Creutz,et al.  Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0 , 2005 .

[7]  Carol Peters,et al.  CLEF 2008: Ad Hoc Track Overview , 2008, CLEF.

[8]  Mikko Kurimo,et al.  On lexicon creation for turkish LVCSR , 2003, INTERSPEECH.

[9]  Alex Acero,et al.  Position Specific Posterior Lattices for Indexing Speech , 2005, ACL.

[10]  Ryen W. White,et al.  Overview of the CLEF-2006 Cross-Language Speech Retrieval Track , 2006, CLEF.

[11]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[12]  Steve J. Young,et al.  A fast lattice-based approach to vocabulary independent wordspotting , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  David Carmel,et al.  Spoken document retrieval from call-center conversations , 2006, SIGIR.

[14]  Kenney Ng,et al.  Subword-based approaches for spoken document retrieval , 2000, Speech Commun..

[15]  Beth Logan,et al.  Confusion-based query expansion for OOV words in spoken document retrieval , 2002, INTERSPEECH.

[16]  Richard Sproat,et al.  Lattice-Based Search for Spoken Utterance Retrieval , 2004, NAACL.

[17]  Jaana Kekäläinen,et al.  Using graded relevance assessments in IR evaluation , 2002, J. Assoc. Inf. Sci. Technol..

[18]  Gökhan Tür,et al.  Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation , 2001, CL.

[19]  David M. Blei,et al.  Topic segmentation with an aspect hidden Markov model , 2001, SIGIR '01.

[20]  Mikko Kurimo,et al.  An evaluation of a spoken document retrieval baseline system in finish , 2004, INTERSPEECH.

[21]  Riitta Alkula From Plain Character Strings to Meaningful Words: Producing Better Full Text Databases for Inflectional and Compounding Languages with Morphological Analysis Software , 2004, Information Retrieval.

[22]  Frank Rudzicz,et al.  A critical assessment of spoken utterance retrieval through approximate lattice representations , 2008, MIR '08.

[23]  Hideki Kozima,et al.  Text Segmentation Based on Similarity between Words , 1993, ACL.

[24]  Eric Fosler-Lussier,et al.  Discourse Segmentation of Multi-Party Conversation , 2003, ACL.

[25]  Ebru Arisoy,et al.  Morph-based speech recognition and modeling of out-of-vocabulary words across languages , 2007, TSLP.

[26]  Mikko Kurimo,et al.  Using latent semantic indexing for morph-based spoken document retrieval , 2006, INTERSPEECH.

[27]  David Anthony James,et al.  The Application of Classical Informa - tion Retrieval Techniques to Spoken Documents , 1995 .

[28]  James P. Callan,et al.  Experiments Using the Lemur Toolkit , 2001, TREC.

[29]  Mikko Kurimo,et al.  Morfessor and variKN machine learning tools for speech and language technology , 2007, INTERSPEECH.

[30]  Alex Acero,et al.  Soft indexing of speech content for search in spoken documents , 2007, Comput. Speech Lang..

[31]  Mikko Kurimo,et al.  Indexing confusion networks for morph-based spoken document retrieval , 2007, SIGIR.

[32]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[33]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[34]  Mikko Kurimo,et al.  Statistical Language Modeling for Automatic Speech Recognition of Agglutinative Languages , 2008 .

[35]  Peng Yu,et al.  Towards Spoken-Document Retrieval for the Internet: Lattice Indexing For Large-Scale Web-Search Architectures , 2006, NAACL.

[36]  Vesa Siivola,et al.  Growing an n-gram language model , 2005, INTERSPEECH.

[37]  Mikko Kurimo,et al.  Unlimited vocabulary speech recognition with morph language models applied to Finnish , 2006, Comput. Speech Lang..

[38]  Lie Lu,et al.  Searching the Audio Notebook: Keyword Search in Recorded Conversation , 2005, HLT.

[39]  Mikko Kurimo,et al.  Overview of Morpho Challenge 2008 , 2008, CLEF.

[40]  Rebecca J. Passonneau,et al.  Discourse Segmentation by Human and Automated Means , 1997, CL.

[41]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[42]  Douglas W. Oard,et al.  One-sided measures for evaluating ranked retrieval effectiveness with spontaneous conversational speech , 2006, SIGIR '06.

[43]  Alan F. Smeaton,et al.  SeLeCT: a lexical cohesion based news story segmentation system , 2004, AI Commun..

[44]  Larry Gillick,et al.  Text segmentation and topic tracking on broadcast news via a hidden Markov model approach , 1998, ICSLP.

[45]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[46]  Mikko Kurimo,et al.  To recover from speech recognition errors in spoken document retrieval , 2005, INTERSPEECH.

[47]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[48]  Hwee Tou Ng,et al.  Statistical lattice-based spoken document retrieval , 2010, TOIS.

[49]  Gareth J. F. Jones,et al.  Overview of the CLEF-2005 Cross-Language Speech Retrieval Track , 2005, CLEF.

[50]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984, ACL.

[51]  Mikko Kurimo,et al.  Morpho Challenge Evaluation Using a Linguistic Gold Standard , 2007, CLEF.

[52]  Krzysztof Marasek,et al.  SPEECON – Speech Databases for Consumer Devices: Database Specification and Validation , 2002, LREC.

[53]  Daniel P. W. Ellis,et al.  Audio information access from meeting rooms , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[54]  Peng Yu,et al.  A hybrid word / phoneme-based approach for improved vocabulary-independent search in spontaneous speech , 2004, INTERSPEECH.

[55]  Mikko Kurimo,et al.  Importance of High-Order N-Gram Models in Morph-Based Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[56]  Ville T. Turunen Reducing the effect of OOV query words by using morph-based spoken document retrieval , 2008, INTERSPEECH.

[57]  Lei Xie,et al.  Multi-Scale TextTiling for Automatic Story Segmentation in Chinese Broadcast News , 2008, AIRS.

[58]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[59]  Lin-Shan Lee,et al.  Analytical comparison between position specific posterior lattices and confusion networks based on words and subword units for spoken document indexing , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[60]  Karen Spärck Jones,et al.  Information Retrieval from Unsegmented Broadcast News Audio , 2001, Int. J. Speech Technol..

[61]  Karen Spärck Jones,et al.  Effects of out of vocabulary words in spoken document retrieval (poster session) , 2000, SIGIR '00.