Multi-scale document expansion in English-Mandarin cross-language spoken document retrieval

This paper presents the application of document expansion using a side collection to a cross-language spoken document retrieval (CL-SDR) task to improve retrieval performance. Document expansion is applied to a series of EnglishMandarin CL-SDR experiments using selected retrieval models (probabilistic belief network, vector space model, and HMM-based retrieval model). English textual queries are used to retrieve relevant documents from an archive of Mandarin radio broadcast news. We have devised a multiscale approach for document expansion – a process that enriches the Mandarin spoken document collection in order to improve overall retrieval performance. A document is expanded by (i) first retrieving related documents on a character bigram scale, (ii) then extracting word units from such related documents as expansion terms to augment the original document and (iii) finally indexing all documents in the collection by means of character bigrams and those expanded terms by within-word character bigrams to prepare for future retrieval. Hence the document expansion approach is multi-scale as it involves both word and subword scales. Experimental results show that this approach achieves performance improvements up to 14% across several retrieval models.

[1]  Kenney Ng Towards robust methods for spoken document retrieval , 1998, ICSLP.

[2]  Howard D. Wactlar,et al.  MULTI-LINGUAL INFORMEDIA: A DEMONSTRATION OF SPEECH RECOGNITION AND INFORMATION RETRIEVAL ACROSS MULTIPLE LANGUAGES , 1998 .

[3]  Craig Silverstein,et al.  Analysis of a Very Large Altavista Query Log" SRC Technical note #1998-14 , 1998 .

[4]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[5]  Puming Zhan,et al.  Dragon Systems' 1998 Broadcast News Transcription Systemfor Mandarin , 1999 .

[6]  Helen Meng,et al.  Document Expansion using a Side Collection for Monolingual and Cross-language Spoken Document Retrieval , 2003 .

[7]  Pak-Chung Ching,et al.  Multi-scale audio indexing for Chinese spoken document retrieval , 2000, INTERSPEECH.

[8]  Lin-Shan Lee,et al.  An HMM/n-gram-based linguistic processing approach for Mandarin spoken document retrieval , 2001, INTERSPEECH.

[9]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[10]  Wai Kit Lo Information fusion for monolingual and cross-language spoken document retrieval , 2002 .

[11]  李幼升,et al.  Ph , 1989 .

[12]  Amit Singhal,et al.  Document expansion for speech retrieval , 1999, SIGIR '99.

[13]  Karen Spärck Jones,et al.  Effects of out of vocabulary words in spoken document retrieval (poster session) , 2000, SIGIR '00.

[14]  Jianqiang Wang,et al.  Mandarin-English Information (MEI): investigating translingual speech retrieval , 2004, Comput. Speech Lang..

[15]  John D. Lafferty,et al.  The Weaver System for Document Retrieval , 1999, TREC.

[16]  Peter Schäuble,et al.  Cross-language speech retrieval: establishing a baseline performance , 1997, SIGIR '97.

[17]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[18]  Ari Pirkola,et al.  The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval , 1998, SIGIR '98.