Correlation between Similarity Measures for Inter-Language Linked Wikipedia Articles

Wikipedia articles in different languages have been mined to support various tasks, such as Cross-Language Information Retrieval (CLIR) and Statistical Machine Translation (SMT). Articles on the same topic in different languages are often connected by inter-language links, which can be used to identify similar or comparable content. In this work, we investigate the correlation between similarity measures utilising language-independent and language-dependent features and respective human judgments. A collection of 800 Wikipedia pairs from 8 different language pairs were collected and judged for similarity by two assessors. We report the development of this corpus and inter-assessor agreement between judges across the languages. Results show that similarity measured using language independent features is comparable to using an approach based on translating non-English documents. In both cases the correlation with human judgments is low but also dependent upon the language pair. The results and corpus generated from this work also provide insights into the measurement of cross-language similarity.

[1]  Heng Ji,et al.  Unsupervised Language-Independent Name Translation Mining from Wikipedia Infoboxes , 2011, ULNLP@EMNLP.

[2]  Qun Liu,et al.  Improving Statistical Machine Translation Performance by Training Data Selection and Optimization , 2007, EMNLP-CoNLL.

[3]  Vasudeva Varma,et al.  Language independent identification of parallel sentences using Wikipedia , 2011, WWW.

[4]  Pablo Gamallo Otero,et al.  Wikipedia as Multilingual Source of Comparable Corpora , 2011 .

[5]  Maarten de Rijke,et al.  Finding Similar Sentences across Multiple Languages in Wikipedia , 2006 .

[6]  Elena Filatova,et al.  Directions for Exploiting Asymmetries in Multilingual Wikipedia , 2009 .

[7]  Takahiro Hara,et al.  Extraction of Bilingual Terminology from a Multilingual Web-based Encyclopedia , 2008, J. Inf. Process..

[8]  Philippe Langlais,et al.  Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia. , 2011, BUCC@ACL.

[9]  Kristina Toutanova,et al.  Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[10]  Kun Yu,et al.  Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity , 2009, HLT-NAACL.

[11]  Iryna Gurevych,et al.  Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary , 2008, LREC.

[12]  Eleazar Eskin,et al.  Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning , 1999, EMNLP.

[13]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.