MultiWiki

In this article, we address the problem of text passage alignment across interlingual article pairs in Wikipedia. We develop methods that enable the identification and interlinking of text passages written in different languages and containing overlapping information. Interlingual text passage alignment can enable Wikipedia editors and readers to better understand language-specific context of entities, provide valuable insights in cultural differences, and build a basis for qualitative analysis of the articles. An important challenge in this context is the tradeoff between the granularity of the extracted text passages and the precision of the alignment. Whereas short text passages can result in more precise alignment, longer text passages can facilitate a better overview of the differences in an article pair. To better understand these aspects from the user perspective, we conduct a user study at the example of the German, Russian, and English Wikipedia and collect a user-annotated benchmark. Then we propose MultiWiki, a method that adopts an integrated approach to the text passage alignment using semantic similarity measures and greedy algorithms and achieves precise results with respect to the user-defined alignment. The MultiWiki demonstration is publicly available and currently supports four language pairs.

[1]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[2]  Benno Stein,et al.  An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[3]  Paolo Massa,et al.  Manypedia: comparing language points of view of Wikipedia communities , 2012, WikiSym '12.

[4]  Pablo N. Mendes,et al.  Improving efficiency and accuracy in multilingual entity extraction , 2013, I-SEMANTICS '13.

[5]  W. Bruce Croft,et al.  Local text reuse detection , 2008, SIGIR '08.

[6]  Tommy W. S. Chow,et al.  Multilayer SOM With Tree-Structured Data for Efficient Document Retrieval and Plagiarism Detection , 2009, IEEE Transactions on Neural Networks.

[7]  Marie-Francine Moens,et al.  Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings , 2015, SIGIR.

[8]  K. Gwet Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters , 2014 .

[9]  Jure Leskovec,et al.  Growing Wikipedia Across Languages via Recommendation , 2016, WWW.

[10]  Kristina Toutanova,et al.  Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[11]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[12]  Mohammad Sadegh Rasooli,et al.  Extracting Parallel Paragraphs and Sentences from English-Persian Translated Documents , 2011, AIRS.

[13]  Michael S. Horn,et al.  Omnipedia: bridging the wikipedia language gap , 2012, CHI.

[14]  Shankar Kumar,et al.  Multilingual Open Relation Extraction Using Cross-lingual Projection , 2015, NAACL.

[15]  Chenhui Chu,et al.  Accurate Parallel Fragment Extraction from Quasi-Comparable Corpora using Alignment Model and Translation Lexicon , 2013, IJCNLP.

[16]  Kevin Duh,et al.  Managing information disparity in multilingual document collections , 2013, TSLP.

[17]  Michael Gertz,et al.  Multilingual and cross-domain temporal tagging , 2012, Language Resources and Evaluation.

[18]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[19]  Scott A. Hale Multilinguals and Wikipedia editing , 2013, WebSci '14.

[20]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[21]  Maarten de Rijke,et al.  Finding Similar Sentences across Multiple Languages in Wikipedia , 2006 .

[22]  Simon Gottschalk,et al.  Analysing Temporal Evolution of Interlingual Wikipedia Article Pairs , 2016, SIGIR.

[23]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[24]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[25]  Naomie Salim,et al.  The development of cross-language plagiarism detection tool utilising fuzzy swarm-based summarisation , 2010, 2010 10th International Conference on Intelligent Systems Design and Applications.

[26]  Roberto Navigli,et al.  Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity , 2013, ACL.

[27]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[28]  Elena Filatova,et al.  Directions for Exploiting Asymmetries in Multilingual Wikipedia , 2009 .

[29]  Roberto Navigli,et al.  Entity Linking meets Word Sense Disambiguation: a Unified Approach , 2014, TACL.

[30]  Michael Skinner,et al.  Information arbitrage across multi-lingual Wikipedia , 2009, WSDM '09.

[31]  Alexander F. Gelbukh,et al.  Adaptive Algorithm for Plagiarism Detection: The Best-Performing Approach at PAN 2014 Text Alignment Competition , 2015, CLEF.

[32]  Ankush Gupta A Generic and Robust Algorithm for Paragraph Alignment and its Impact on Sentence Alignment in Parallel Corpora , .

[33]  Alberto Barrón-Cedeño,et al.  A Comparison of Approaches for Measuring Cross-Lingual Similarity of Wikipedia Articles , 2014, ECIR.

[34]  Jian Hu,et al.  Cross lingual text classification by mining multilingual topics from wikipedia , 2011, WSDM '11.

[35]  Mehdi Mohammadi,et al.  Building Bilingual Parallel Corpora Based on Wikipedia , 2010, 2010 Second International Conference on Computer Engineering and Applications.

[36]  Johanna D. Moore,et al.  Empirical Studies in Discourse , 1997, CL.

[37]  ELENA BARALIS,et al.  MWI-Sum: A Multilingual Summarizer Based on Frequent Weighted Itemsets , 2015, TOIS.

[38]  Richard A. Rogers Digital Methods , 2013 .

[39]  Ahmet Aker,et al.  Correlation between Similarity Measures for Inter-Language Linked Wikipedia Articles , 2012, LREC.

[40]  Mark Graham,et al.  The most controversial topics in Wikipedia: A multilingual and geographical analysis , 2013, ArXiv.