Dutch speech recognition in multimedia information retrieval

As data storage capacities grow to nearly unlimited sizes thanks to ever ongoing hardware and software improvements, an increasing amount of information is being stored in multimedia and spoken-word collections. Assuming that the intention of data storage is to use (portions of) it some later time, these collections must also be searchable in one way or another. For multimedia and spoken-word collections, traditional text-oriented information retrieval (IR) strategies inevitably fall short, as the amount of textual information included with these types of documents is usually very limited. However, when automatic speech recognition (ASR) can be used to convert the speech occurring in these documents into text, textual representations can be created that in turn can be searched using the traditional text-based search strategies. As ASR systems label recognized words with exact time information as a standard accessory, detailed searching within multimedia and spoken-word collections can be enabled. This type of retrieval is commonly referred to as Spoken Document Retrieval (SDR). Typically, large vocabulary speaker independent continuous speech recognition systems (LVCSR) are deployed for creating textual representations of the spoken audio in multimedia an spoken-word collections. For Dutch however, such a system was not available when this research was started. As creating a Dutch system from scratch was not feasible given the available resources, an existing English system, refered to as the ABBOT system, was ported to Dutch. A significant part of this thesis is dedicated to a complete run-down of the porting work, involving the collection and preparation of suitable training data and the actual training and evaluation of the acoustic models and language models. The broadcast news domain was chosen as domain of focus, as this domain has also been extensively used as a benchmark domain for both international ASR research and SDR. A complicating factor for ASR in the news domain, is that word usage is highly variable. As a consequence, besides using large vocabularies, it is important to adjust these vocabularies regularly, so that they reflect the content of the news programs well. Therefore, it has been investigated which word selection strategies are best suited for making these vocabulary adjustments. Moreover, as dynamic vocabularies require a flexible generation of accurate word pronunciations, the development of a grapheme-to-phoneme converter is addressed. Another vocabulary related issue that is investigated, stems from a well-known characteristic of the Dutch language, word compounding: Dutch words can almost freely be joined together to form new words. As a result of this phenomenon, the number of distinct words in Dutch is relatively large, which reduces the coverage of vocabularies compared to those of the same size of other languages, such as English, that do not have word compounding. This thesis investigates whether splitting Dutch compound words could be a remedy for the relatively limited coverage of vocabularies, so that ASR performance could be improved. Next to a brief history of SDR research and a review of possible SDR approaches, this thesis demonstrates the use of a Dutch LVCSR in SDR by providing an illustrative example of an SDR evaluation given a collection of Dutch broadcast news shows. It is shown that Dutch speech recognition can successfully be deployed for content-based retrieval of broadcast news programs. The experience obtained with the research described in this thesis, and the experience that will emerge from future research efforts must contribute to the long-term accessibility of the increasing amount of information being stored in Dutch multimedia and spoken-word collections.

[1]  Gosse Bouma,et al.  De positie van het Nederlands in de taal- en spraaktechnologie , 1998 .

[2]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[3]  M. Petkovic,et al.  Content-based Video Retrieval Supported by Database Technology , 2003 .

[4]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[5]  Franciska de Jong,et al.  Compound decomposition in dutch large vocabulary speech recognition , 2003, INTERSPEECH.

[6]  Karen Spärck Jones,et al.  Retrieving spoken documents by combining multiple index sources , 1996, SIGIR '96.

[7]  de Franciska Jong,et al.  OLIVE: Speech-Based Video Retrieval , 1998 .

[8]  Ronald Rosenfeld,et al.  Using story topics for language model adaptation , 1997, EUROSPEECH.

[9]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[10]  Franco Mana,et al.  Using machine learning techniques for grapheme to phoneme transcription , 2001, INTERSPEECH.

[11]  Bernard J. Jaworski,et al.  E-Commerce , 2021, Strategic International Restaurant Development.

[12]  A. J. Robinson,et al.  Connectionist model combination for large vocabulary speech recognition , 1994, Proceedings of IEEE Workshop on Neural Networks for Signal Processing.

[13]  T. G. Vosse The Word Connection. Grammar-based Spelling Error Correction in Dutch , 1994 .

[14]  Charles L. Wayne Multilingual Topic Detection and Tracking: Successful Research Enabled by Corpora and Evaluation , 2000, LREC.

[15]  R. H. Baayen,et al.  The CELEX Lexical Database (CD-ROM) , 1996 .

[16]  Thijs Westerveld Probabilistic multimedia retrieval , 2002, SIGIR '02.

[17]  Andreas Stolcke,et al.  Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[18]  L. H. Anauer,et al.  Speech Analysis and Synthesis by Linear Prediction of the Speech Wave , 2000 .

[19]  E. A. Flinn Comments on “Speech Analysis and Synthesis by Linear Prediction of the Speech Wave” [B. S. Atal and S. L. Hanauer, J. Acoust. Soc. Amer. 50, 637–655 (1971)] , 1972 .

[20]  A. P. deVries,et al.  Known-item retrieval on broadcast TV , 2001 .

[21]  Stanley F. Chen,et al.  Language and Pronunciation Modeling in the CMU 1996 Hub 4 Evaluation , 1999 .

[22]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[23]  John Nerbonne,et al.  Connectionist grapheme to phoneme conversion: exploring distributed representation , 1999, CLIN.

[24]  A.P.J. van den Bosch,et al.  Learning to pronounce written words : a study in inductive language learning , 1997 .

[25]  Jean-Luc Gauvain,et al.  The LIMSI SDR System for TREC-8 , 1999, TREC.

[26]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[27]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[28]  Steve Renals,et al.  Start-synchronous search for large vocabulary continuous speech recognition , 1999, IEEE Trans. Speech Audio Process..

[29]  Trumpington Street,et al.  A FAST LATTICE-BASED APPROACH TO VOCABULARY INDEPENDENT WORDSPOTTING , 1994 .

[30]  Karen Sparck Jones,et al.  Spoken Document Retrieval for TREC-8 at Cambridge University , 1998, TREC.

[31]  Steve Renals,et al.  Retrieval of broadcast news documents with the THISL system , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[32]  Tony Robinson,et al.  Time-first search for large vocabulary speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[33]  Edie Rasmussen,et al.  Sound and Speech in Information Retrieval: An Introduction , 2005 .

[34]  Philip C. Woodland,et al.  The development of the HTK Broadcast News transcription system: An overview , 2002, Speech Commun..

[35]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[36]  David S. Pallett The role of the National Institute of Standards and Technology in DARPA's Broadcast News continuous speech recognition research program , 2002, Speech Commun..

[37]  Andrew Merlino,et al.  Segmentation, Content Extraction and Visualization of Broadcast News Video using Multistream Analysis , 1997 .

[38]  Karen Spärck Jones,et al.  Automatic content-based retrieval of broadcast news , 1995, MULTIMEDIA '95.

[39]  G. W. Hughes,et al.  Minimum Prediction Residual Principle Applied to Speech Recognition , 1975 .

[40]  Van Leeuwen Prediction of keyword spotting performance based on phonemic contents , 1999 .

[41]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[42]  Amanda Spink,et al.  Interaction in information retrieval: selection and effectiveness of search terms , 1997 .

[43]  David Graff An overview of Broadcast News corpora , 2002, Speech Commun..

[44]  Roeland Ordelman,et al.  Improving Recognition Performance Using Co-articulation Rules on the Phrase Level: A First Approach , 1999 .

[45]  Martine Adda-Decker,et al.  MORPHOLOGICAL DECOMPOSITION FOR ASR IN GERMAN , 2000 .

[46]  Karen Spärck Jones,et al.  Talker-independent keyword spotting for information retrieval , 1995, EUROSPEECH.

[47]  Daniel P. W. Ellis,et al.  Connectionist speech recognition of Broadcast News , 2002, Speech Commun..

[48]  Djoerd Hiemstra,et al.  Language-Based Multimedia Information Retrieval , 2000, RIAO.

[49]  Karen Spärck Jones,et al.  Effects of out of vocabulary words in spoken document retrieval (poster session) , 2000, SIGIR '00.

[50]  Alan F. Smeaton,et al.  Taiscéalaí: Information Retrieval from an Archive of Spoken Radio News , 1998, ECDL.

[51]  J. G. Kruyt The Integrated Language Database of 8th - 21st-Century Dutch , 2004, LREC.

[52]  Richard P. Lippmann,et al.  Techniques for information retrieval from voice messages , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[53]  Andreas Stolcke,et al.  Improved modeling and efficiency for automatic transcription of Broadcast News , 2002, Speech Commun..

[54]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[55]  C.M.T. Metselaar,et al.  Sociaal-organisatorische gevolgen van kennistechnologie : een procesbenadering en actorperspectief , 2000 .

[56]  Ren'ee Pohlmann Wessel Kraaij Improving the Precision of a Text Retrieval System with Compound Analysis , 1996 .

[57]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[58]  Karen Spärck Jones,et al.  Open-vocabulary speech indexing for voice and video mail retrieval , 1997, MULTIMEDIA '96.

[59]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[60]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[61]  Ellen M. Voorhees,et al.  The fifth text REtrieval conference (TREC-5) , 1997 .

[62]  Steve Renals,et al.  Topic-based mixture language modelling , 1999, Nat. Lang. Eng..

[63]  Maarten de Rijke,et al.  Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian , 2001, CLEF.

[64]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[65]  David Anthony James,et al.  The Application of Classical Informa - tion Retrieval Techniques to Spoken Documents , 1995 .

[66]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[67]  Ellen M. Voorhees,et al.  Spoken Document Retrieval Track Slides , 2000, Text Retrieval Conference.

[68]  Lalit R. Bahl,et al.  Design of a linguistic statistical decoder for the recognition of continuous speech , 1975, IEEE Trans. Inf. Theory.

[69]  S. Renals,et al.  Phone deactivation pruning in large vocabulary continuous speech recognition , 1996, IEEE Signal Processing Letters.

[70]  Dietrich Klakow,et al.  Testing the correlation of word error rate and perplexity , 2002, Speech Commun..

[71]  Lalit R. Bahl,et al.  A tree-based statistical language model for natural language speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[72]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[73]  A. P. deVries Content and multimedia database management systems , 1999 .

[74]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[75]  Ronald Rosenfeld,et al.  Optimizing lexical and N-gram coverage via judicious use of linguistic data , 1995, EUROSPEECH.

[76]  Roeland J.F. Ordelman Zoeken in historisch videomateriaal , 2000 .

[77]  André Berton,et al.  Compound words in large-vocabulary German speech recognition systems , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[78]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[79]  J. Martens,et al.  Pronunciation Variation Modeling for Dutch Automatic Speech Recognition , 2002 .

[80]  Nelson Morgan,et al.  Dynamic pronunciation models for automatic speech recognition , 1999 .

[81]  Stanley F. Chen,et al.  Evaluation Metrics For Language Models , 1998 .

[82]  Manny Rayner,et al.  Handling compound nouns in a Swedish speech-understanding system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[83]  Ellen M. Voorhees,et al.  The TREC-6 Spoken Document Retrieval Track , 2005 .

[84]  Mark Liberman,et al.  THE TDT-2 TEXT AND SPEECH CORPUS , 1999 .

[85]  Karel Pala,et al.  TreeTalk-D : a Machine Learning Approach to Dutch Word Pronunciation , 1998 .

[86]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[87]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[88]  Judith Maria Kessens,et al.  Making a difference : on automatic transcription and modeling of Dutch pronunciation variation for automatic speech recognition , 2002 .

[89]  Lori Lamel,et al.  The Use of Lexica in Automatic Speech Recognition , 2000 .

[90]  Mikko Kurimo,et al.  Large vocabulary statistical language modeling for continuous speech recognition in finnish , 2001, INTERSPEECH.

[91]  Anthony J. Robinson,et al.  THE 1997 ABBOT SYSTEM FOR THE TRANSCRIPTION OF BROADCAST NEWS , 1997 .

[92]  Gunnar Evermann,et al.  What is wrong with the lexicon - an attempt to model pronunciations probabilistically , 1997, EUROSPEECH.

[93]  Ineke Schuurman,et al.  ANNO; a multi-functional Flemish text corpus , 1997 .

[94]  Jonathan G. Fiscus,et al.  Automatic Language Model Adaptation for Spoken Document Retrieval , 2000, RIAO.

[95]  Nelleke Oostdijk,et al.  The Spoken Dutch Corpus , 2000 .

[96]  Jean-Luc Gauvain,et al.  Investigating lightly supervised acoustic model training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[97]  Martha Larson,et al.  Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches , 2000, INTERSPEECH.

[98]  Wessel Kraaij,et al.  TNO TREC7 Site Report: SDR and Filtering , 1998, TREC.

[99]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[100]  Amit Singhal,et al.  Document expansion for speech retrieval , 1999, SIGIR '99.

[101]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[102]  Jean Paul Ballerini,et al.  Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.

[103]  John P. Eakins,et al.  Automatic image content retrieval - are we getting anywhere? , 2002 .

[104]  Alexander G. Hauptmann,et al.  Speech recognition for a digital video library , 1999 .

[105]  Franciska de Jong,et al.  Speech Recognition Issues for Dutch Spoken Document Retrieval , 2001, TSD.

[106]  Helmer Strik,et al.  Modeling pronunciation variation for ASR: A survey of the literature , 1999, Speech Commun..

[107]  Ralph Grishman,et al.  NYU Language Modeling Experiments for the 1995 CSR Evaluation , 1995 .

[108]  L. M. M.-T. Theory of Probability , 1929, Nature.

[109]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[110]  Jean-Luc Gauvain,et al.  Transcribing Broadcast News: The LIMSI Nov96 Hub4 System , 1997 .

[111]  R. Schwartz,et al.  The N-best algorithms: an efficient and exact procedure for finding the N most likely sentence hypotheses , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[112]  Djoerd Hiemstra,et al.  Lazy Users and Automatic Video Retrieval Tools in (the) Lowlands , 2001, TREC.

[113]  Anthony J. Robinson,et al.  Context-Dependent Classes in a Hybrid Recurrent Network-HMM Speech Recognition System , 1995, NIPS.

[114]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[115]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[116]  K. Sparck Jones,et al.  General query expansion techniques for spoken document retrieval , 1999 .

[117]  Wessel Kraaij,et al.  Phoneme based spoken document retrieval , 1998 .

[118]  Jean-Luc Gauvain,et al.  Language modeling for broadcast news transcription , 1999, EUROSPEECH.

[119]  Ralph Weischedel,et al.  NAMED ENTITY EXTRACTION FROM SPEECH , 1998 .

[120]  J. M. de Veth On speech sound model accuracy , 2001 .

[121]  Steve Renals,et al.  THE USE OF RECURRENT NEURAL NETWORKS IN CONTINUOUS SPEECH RECOGNITION , 1996 .

[122]  Nelleke Oostdijk,et al.  The Spoken Dutch Corpus. Overview and First Evaluation , 2000, LREC.

[123]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[124]  Steve Renals,et al.  Recognition, indexing and retrieval of british broadcast news with the THISL system , 1999, EUROSPEECH.

[125]  Anthony J. Robinson,et al.  Language model adaptation using mixtures and an exponentially decaying cache , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[126]  Fillia Makedon,et al.  Cross-modal information retrieval , 1999 .

[127]  Jerome R. Bellegarda Large vocabulary speech recognition with multispan statistical language models , 2000, IEEE Trans. Speech Audio Process..

[128]  Franciska de Jong,et al.  Lexicon optimization for dutch speech recognition in spoken document retrieval , 2001, INTERSPEECH.

[129]  David A. van Leeuwen,et al.  Dealing with Phrase Level Co-Articulation (PLC) in speech recognition: a first approach , 1999 .

[130]  Kenney Ng,et al.  Subword-based approaches for spoken document retrieval , 2000, Speech Commun..

[131]  Reinhard Kneser,et al.  Designing very compact decision trees for grapheme-to-phoneme transcription , 2001, INTERSPEECH.

[132]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[133]  P J Webros BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[134]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[135]  Michael J. Carey,et al.  Topic spotting with task independent models , 1995, EUROSPEECH.

[136]  Silvia Pfeiffer Information Retrieval aus digitalisierten Audiospuren von Filmen , 1999 .

[137]  Jean-Luc Gauvain,et al.  Developments in continuous speech dictation using the ARPA WSJ task , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[138]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[139]  Gosse Bouma,et al.  A Finite State and Data-Oriented Method for Grapheme to Phoneme Conversion , 2000, ANLP.

[140]  Franciska de Jong,et al.  Speech Recognition for Dutch Spoken Document Retrieval , 2001 .

[141]  Walter Daelemans,et al.  Data-Oriented Methods for Grapheme-to-Phoneme Conversion , 1993, EACL.

[142]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[143]  Richard M. Schwartz,et al.  Progress in transcription of Broadcast News using Byblos , 2002, Speech Commun..

[144]  Lou Boves,et al.  Modeling lexical stress in continuous speech recognition for Dutch , 2003, Speech Commun..