Using Uplug and SiteSeeker to construct a cross language search engine for Scandinavian languages

This paper presents how we adapted a website search engine for cross language information retrieval, using the Uplug word alignment tool for parallel corpora.We first studied the monolingual search queries posed by the visitors of the website of the Nordic council containing five different languages. In order to compare how well different types of bilingual dictionaries covered the most common queries and terms on the website we tried a collection of ordinary bilingual dictionaries, a small manually constructed trilingual dictionary and an automatically constructed trilingual dictionary, constructed from the news corpus in the website using Uplug. The pre-cision and recall of the automatically constructed Swedish-English dictionary using Uplug were 71 and 93 percent, re-spectively. We found that precision and recall increase significantly in samples with high word frequency, but we could not confirm that POS-tags improve pre-cision. The collection of ordinary dic-tionaries, consisting of about 200 000 words, only cover 41 of the top 100 search queries at the website. The automatically built trilingual dictionary com-bined with the small manually built trilingual dictionary, consisting of about 2 300 words, and covers 36 of the top search queries.

[1]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[2]  Viggo Kann,et al.  Free construction of a free Swedish dictionary of synonyms , 2005, NODALIDA.

[3]  Jonas Sjöbergh,et al.  Vad kan statistik avslöja om svenska sammansättningar , 2006 .

[4]  Hermann Ney,et al.  Statistical Machine Translation of German Compound Words , 2006, FinTAL.

[5]  Christopher D. Manning,et al.  Extentions to HMM-based Statistical Word Alignment Models , 2002, EMNLP.

[6]  Jay F. Nunamaker,et al.  Multilingual Web Retrieval: An Experiment on a Multilingual Business Intelligence Portal , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[7]  Hercules Dalianis,et al.  Evaluating a Spelling Support in a Search Engine , 2002, NLDB.

[8]  Beáta Megyesi Data-Driven Methods for PoS Tagging and Chunking of Swedish , 2001, NODALIDA.

[9]  Sumithra Velupillai,et al.  Automatic Construction of Domain-specific Dictionaries on Sparse Parallel Corpora in the Nordic languages , 2008, COLING 2008.

[10]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[11]  Ola Knutsson,et al.  Improving Precision in Information Retrieval for Swedish using Stemming , 2001, NODALIDA.

[12]  Konstantinos Charitakis Using Parallel Corpora to Create a Greek-English Dictionary with Uplug , 2007, NODALIDA.

[13]  Peter Strömbäck,et al.  The Impact of Lemmatization in Word Alignment , 2005 .

[14]  Chin-Yew Lin,et al.  Machine translation for information access across the language barrier: the MuST system , 1999, MTSUMMIT.

[15]  Jörg Tiedemann,et al.  Evaluation of Word Alignment Systems , 2000, LREC.

[16]  Jonas Sjöbergh Creating a free digital Japanese-Swedish lexicon , 2005 .

[17]  Mansour Sarr Improving precision and recall using a spell checker in a search engine , 2003 .

[18]  Bettina Schrader,et al.  Improving Word Alignment Quality Using Linguistic Knowledge , 2006 .

[19]  Chu-Ren Huang,et al.  22nd International Conference on Computational Linguistics , 2008 .

[20]  Anna Sågvall Hein,et al.  Building a Swedish-Turkish Parallel Corpus , 2006, LREC.

[21]  Lars Borin Pivot Alignment , 1999, NODALIDA.

[22]  George F. Foster,et al.  Quantum, a French/English Cross-Language Question Answering System , 2003, CLEF.

[23]  Jörg Tiedemann Recycling Translations : Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing , 2003 .

[24]  Lars Borin You'll Take the High Road and I'll Take the Low Road: Using a Third Language to Improve Bilingual Word Alignment , 2000, COLING.

[25]  Viggo Kann,et al.  Tvärslå - defining an XML exchange format and then building an on-line Nordic dictionary , 2007 .

[26]  Atelach Alemu Argaw,et al.  Dictionary-based Amharic - English Information Retrieval , 2004, CLEF.

[27]  Pierre Zweigenbaum,et al.  Creating a medical English-Swedish dictionary using interactive word alignment , 2006, BMC Medical Informatics Decis. Mak..

[28]  Anni Järvelin,et al.  Dictionary-independent translation in CLIR between closely related languages , 2006 .

[29]  Beáta Megyesi,et al.  The Swedish-Turkish Parallel Corpus and Tools for its Creation , 2007, NODALIDA.