Multilingual Spoken Term Detection: Finding and Testing New Pronunciations

When you listen to the evening news, or read a newspaper, book or web site, there is a good chance that you will hear or see a term — perhaps a name, perhaps a technical term — that you have never seen before. Such words are often novel or rare and are often names (of people, places, organizations. . . ). They are hard for humans to process, but they are even harder for automatic speech and language processing systems. For a single language, a speech recognition or text-to-speech system needs to know how to pronounce a word to recognize or say it. For two languages, in particular a pair with different writing systems, a search engine or document summarizer needs to know how to transcribe one word to another to retrieve or distill across languages. For example, the soccer player written in English text as Maciej Zurawski would appear as 마치에이주라브스키 (ma-chi-e-i ju-ra-beu-seu-ki) in Korean In this project we attack both problems – unusual term pronunciation and term transcription. For pronunciation, we make use of the huge numbers of pronunciations that are now available in various forms on the web to mine pronunciations. This ranges from straightforward, such as dictionary sites and Wikipedia entries where people use a fairly strict phonetic transcription system such as IPA, to difficult such as: • Trio Shares Nobel Prize in Medicine – The Nobel is a particularly striking achievement for Capecchi (pronounced kuh-PEK’-ee). Here we need to look in the vicinity of the name “Capecchi” to find the pronunciation, make use of the word “pronounced”, and then interpret the writer’s attempt to render the pronunciation using an Englishbased ad-hoc “phonetic” orthography. The problem is therefore one of entity extraction, where the entities to extract can be either relatively easy or relatively hard. A relatively easy case is Wikipedia, which uses standard IPA transcriptions that are clearly delimited by markup. On general web pages, tokens with Unicode IPA characters are potential pronunciations. Data extracted from Wikipedia can be matched against these tokens to provide training material for entity extraction. Statistical entity extractors for the more difficult case of ad-hoc phonetic transcriptions (such as “kuh-PEK’-ee” above) can be bootstrapped from unannotated web pages containing patterns such as “pronounced as”. These entity extractors make use of both the textual environment and the letter-to-sound constraints between the candidate pronunciation and its corresponding orthography. We also use speech data to test possible pronunciation variants by comparing the performance of spoken term detection systems using these different variants. Pronunciations mined from the web are used to suggest pronunciations for spoken term detection; transcription are used to suggest reasonable candidates to search for in a speech stream in another language. We use a novel technique called delayed-decision testing to test candidate pronunciations in speech, and to choose the best one from a set of candidates via a sequential testing procedure, with the associated null hypothesis stating that all candidate pronunciations exhibit the same performance on average. Spoken term detection are in turn used for automatic labeling of practice data acquired to test this null hypothesis; however, this automatic labeling procedure inevitably induces false alarms as well as correct detections. Delayed-decision testing are then used to choose the correct pronunciation in spite of these false alarms, leading to improved pronunciations for newly identified terms. For transcription, we use available resources – dictionaries, and text corpora – as well as methods for phonetic matching across scripts and tracking names across time in comparable corpora (such as news sources).

[1]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[2]  Atsuhiro Takasu,et al.  A Smoothing Method for a Statistical String Similarity , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[3]  Olivier Siohan,et al.  Fast vocabulary-independent audio search using path-based graph indexing , 2005, INTERSPEECH.

[4]  Mark A. Clements,et al.  Phonetic searching applied to on-line distance learning modules , 2002, Proceedings of 2002 IEEE 10th Digital Signal Processing Workshop, 2002 and the 2nd Signal Processing Education Workshop..

[5]  Mansur Arbabi,et al.  Algorithms for Arabic name transliteration , 1994, IBM J. Res. Dev..

[6]  Grzegorz Kondrak,et al.  Bootstrapping a Stochastic Transducer for Arabic-English Transliteration Extraction , 2007, ACL.

[7]  Long Jiang,et al.  Named Entity Translation with Web Mining and Transliteration , 2007, IJCAI.

[8]  Worldbet,et al.  ASCII Phonetic Symbols for the World s Languages Worldbet , 1994 .

[9]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[10]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[11]  W. Bruce Croft,et al.  Resolving ambiguity for cross-language retrieval , 1998, SIGIR '98.

[12]  Rodney W. Johnson,et al.  Letter-to-sound rules for automatic translation of english text to phonetics , 1976 .

[13]  Jae Sung Lee,et al.  English to Korean Statistical Transliteration for Information Retrieval , 2008 .

[14]  Fan Yang,et al.  Chinese-English Backward Transliteration Assisted with Mining Monolingual Web Pages , 2008, ACL.

[15]  Dan Roth,et al.  Active Sample Selection for Named Entity Transliteration , 2008, ACL.

[16]  Kumiko Tanaka-Ishii,et al.  Extraction of Lexical Translations from Non-Aligned Corpora , 1996, COLING.

[17]  Kevin Knight,et al.  Name Translation in Statistical Machine Translation - Learning When to Transliterate , 2008, ACL.

[18]  Hermann Ney,et al.  Investigations on joint-multigram models for grapheme-to-phoneme conversion , 2002, INTERSPEECH.

[19]  Jack Halpern The Challenges and Pitfalls of Arabic Romanization and Arabization , 2007 .

[20]  Andrew McCallum,et al.  A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance , 2005, UAI.

[21]  Tao Tao,et al.  Named Entity Transliteration with Comparable Corpora , 2006, ACL.

[22]  Beth Logan,et al.  Confusion-based query expansion for OOV words in spoken document retrieval , 2002, INTERSPEECH.

[23]  Mitch Weintraub,et al.  Learning name pronunciations in automatic speech recognition systems , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[24]  Tony Vitale,et al.  An Algorithm for High Accuracy Name Pronunciation by Parametric Speech Synthesizer , 1991, Comput. Linguistics.

[25]  Karen Spärck Jones,et al.  Effects of out of vocabulary words in spoken document retrieval (poster session) , 2000, SIGIR '00.

[26]  Su-Youn Yoon,et al.  Multilingual Transliteration Using Feature based Phonetic Method , 2007, ACL.

[27]  Hynek Hermansky,et al.  Combination of strongly and weakly constrained recognizers for reliable detection of OOVS , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  MarchandYannick,et al.  A multistrategy approach to improving pronunciation by analogy , 2000 .

[29]  Bhuvana Ramabhadran,et al.  Acoustics-only based automatic phonetic baseform generation , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[30]  Michael Picheny,et al.  Improvements in phone based audio search via constrained match with high order confusion estimates , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[31]  Haizhou Li,et al.  Semantic Transliteration of Personal Names , 2007, ACL.

[32]  Sanjeev Khudanpur,et al.  Transliteration of Proper Names in Cross-Lingual Information Retrieval , 2003, NER@ACL.

[33]  Nasreen AbdulJaleel,et al.  English to Arabic Transliteration for Information Retrieval : A Statistical Approach , 2002 .

[34]  Paul Taylor,et al.  Hidden Markov models for grapheme to phoneme conversion , 2005, INTERSPEECH.

[35]  Robert L. Mercer,et al.  An information theoretic approach to the automatic determination of phonemic baseforms , 1984, ICASSP.

[36]  In-Ho Kang,et al.  English-to-Korean Transliteration using Multiple Unbounded Overlapping Phoneme Chunks , 2000, COLING.

[37]  Peng Yu,et al.  Vocabulary-independent search in spontaneous speech , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  Peter Beyerlein,et al.  Discriminative model combination , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[39]  Tao Tao,et al.  Mining comparable bilingual text corpora for cross-language information integration , 2005, KDD '05.

[40]  William J. Byrne,et al.  Stochastic pronunciation modelling from hand-labelled phonetic corpora , 1999, Speech Commun..

[41]  Hwee Tou Ng,et al.  Mining New Word Translations from Comparable Corpora , 2004, COLING.

[42]  Tao Tao,et al.  Unsupervised Named Entity Transliteration Using Temporal and Phonetic Correlation , 2006, EMNLP.

[43]  Bruno Pouliquen,et al.  Similarity of Names Across Scripts: Edit Distance Using Learned Costs of N-Grams , 2008, GoTAL.

[44]  Pascale Fung,et al.  A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora , 1995, ACL.

[45]  Alan W. Black,et al.  Issues in building general letter to sound rules , 1998, SSW.

[46]  Cyril Allauzen,et al.  General Indexation of Weighted Automata - Application to Spoken Utterance Retrieval , 2004, HLT-NAACL 2004.

[47]  Paul Taylor,et al.  The architecture of the Festival speech synthesis system , 1998, SSW.

[48]  What ’ s in a Name ? : Proper Names in Arabic Cross Language Information Retrieval , 2003 .

[49]  A. Kawtrakul,et al.  Backward transliteration for Thai document retrieval , 1998, IEEE. APCCAS 1998. 1998 IEEE Asia-Pacific Conference on Circuits and Systems. Microelectronics and Integrating Systems. Proceedings (Cat. No.98EX242).

[50]  Stanley Peters,et al.  A Bootstrapping Method for Extracting Bilingual Text Pairs , 2000, COLING.

[51]  Lidia Khmylko Supervised by : , 1991 .

[52]  Geoffrey Zweig,et al.  The IBM 2004 conversational telephony system for rich transcription , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[53]  Salim Roukos,et al.  Ad hoc and Multilingual Information Retrieval at IBM , 1998, TREC.

[54]  Thomas G. Dietterich Machine Learning for Sequential Data: A Review , 2002, SSPR/SPR.

[55]  Leah S. Larkey,et al.  Statistical transliteration for english-arabic cross language information retrieval , 2003, CIKM '03.

[56]  Siddika Parlak,et al.  Spoken term detection for Turkish Broadcast News , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[57]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[58]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[59]  Berlin Chen,et al.  Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[60]  Haizhou Li,et al.  A phonetic similarity model for automatic extraction of transliteration pairs , 2007, TALIP.

[61]  Abeer Alwan,et al.  Pronunciation verification of children²s speech for automatic literacy assessment , 2006, INTERSPEECH.

[62]  Eiichiro Sumita,et al.  Word Pronunciation Disambiguation using the Web , 2006, HLT-NAACL.

[63]  Geoffrey Zweig,et al.  Confidence estimation, OOV detection and language ID using phone-to-word transduction and phone-level alignments , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[64]  Brian Roark,et al.  Comparing and Combining Finite-State and Context-Free Parsers , 2005, HLT/EMNLP.

[65]  Masatoshi Yoshikawa,et al.  Bilingual Terminology Acquisition from Comparable Corpora and Phrasal Translation to Cross-Language Information Retrieval , 2003, ACL.

[66]  Fernando Pereira,et al.  Weighted Automata in Text and Speech Processing , 2005, ArXiv.

[67]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[68]  Pascale Pung,et al.  A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora , 1995, ACL 1995.

[69]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[70]  Yaser Al-Onaizan,et al.  Machine Transliteration of Names in Arabic Texts , 2002, SEMITIC@ACL.

[71]  Bhuvana Ramabhadran,et al.  Vocabulary independent spoken term detection , 2007, SIGIR.

[72]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[73]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[74]  Wei Gao,et al.  Phoneme-Based Transliteration of Foreign Names for OOV Problem , 2004, IJCNLP.

[75]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[76]  Robert I. Damper,et al.  A multistrategy approach to improving pronunciation by analogy , 2000, CL.

[77]  Richard Sproat,et al.  Lattice-Based Search for Spoken Utterance Retrieval , 2004, NAACL.