Tailoring and evaluating the Wikipedia for in-domain comparable corpora extraction

We propose an automatic language-independent graph-based method to build a-la-carte article collections on user-defined domains from the Wikipedia. The core model is based on the exploration of the encyclopaedia's category graph and can produce both monolingual and multilingual comparable collections. We run thorough experiments to assess the quality of the obtained corpora in 10 languages and 743 domains. According to an extensive manual evaluation, our graph-based model outperforms a retrieval-based approach and reaches an average precision of 84% on in-domain articles. As manual evaluations are costly, we introduce the concept of "domainness" and design several automatic metrics to account for the quality of the collections. Our best metric for domainness shows a strong correlation with the human-judged precision, representing a reasonable automatic alternative to assess the quality of domain-specific corpora. We release the WikiTailor toolkit with the implementation of the extraction methods, the evaluation measures and several utilities. WikiTailor makes obtaining multilingual in-domain data from the Wikipedia easy.

[1]  Chenhui Chu,et al.  Iterative Bilingual Lexicon Extraction from Comparable Corpora with Topical and Contextual Knowledge , 2014, CICLing.

[2]  Martin Volk,et al.  Mining for Domain-specific Parallel Text from Wikipedia , 2013, BUCC@ACL.

[3]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[4]  Min Zhang,et al.  Feature-Based Method for Document Alignment in Comparable News Corpora , 2009, EACL.

[5]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[6]  Qin Lu,et al.  Corpus Exploitation from Wikipedia for Ontology Construction , 2008, LREC.

[7]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[8]  Martin Volk,et al.  Towards a Wikipedia-extracted alpine corpus , 2012 .

[9]  Holger Schwenk,et al.  WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia , 2019, EACL.

[10]  Rada Mihalcea,et al.  Cross-lingual Semantic Relatedness Using Encyclopedic Knowledge , 2009, EMNLP.

[11]  Kristina Toutanova,et al.  Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[12]  Philippe Langlais,et al.  A Comparison of Methods for Identifying the Translation of Words in a Comparable Corpus: Recipes and Limits , 2016, Computación y Sistemas.

[13]  Maarten de Rijke,et al.  Finding Similar Sentences across Multiple Languages in Wikipedia , 2006 .

[14]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[15]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[16]  Miquel Espl,et al.  Bitextor, a free/open-source software to harvest translation memories from multilingual websites , 2009 .

[17]  Bogdan Babych,et al.  Measuring Comparability of Documents in Non-Parallel Corpora for Efficient Extraction of (Semi-)Parallel Translation Equivalents , 2012, ESIRMT/HyTra@EACL.

[18]  Tao Tao,et al.  Mining comparable bilingual text corpora for cross-language information integration , 2005, KDD '05.

[19]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[20]  Ádám Varga,et al.  Domain Adaptation for Multilingual Neural Machine Translation , 2017 .

[21]  Pierre Vandergheynst,et al.  A Graph-Structured Dataset for Wikipedia Research , 2019, WWW.

[22]  Sabine Hunsicker,et al.  Hybrid Parallel Sentence Mining from Comparable Corpora , 2012, EAMT.

[23]  Ibrahim Abu El-Khair,et al.  Arabic information retrieval , 2007, Annu. Rev. Inf. Sci. Technol..

[24]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[25]  Martti Juhola,et al.  Focused web crawling in the acquisition of comparable corpora , 2008, Information Retrieval.

[26]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[27]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[28]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[29]  Pablo Gamallo Otero Measuring Comparability of Multilingual Corpora Extracted from Wikip edia , 2011 .

[30]  Darren Gergle,et al.  The tower of Babel meets web 2.0: user-generated content and its applications in a multilingual context , 2010, CHI.

[31]  Josef van Genabith,et al.  Self-Supervised Neural Machine Translation , 2019, ACL.

[32]  Benno Stein,et al.  Insights into explicit semantic analysis , 2011, CIKM '11.

[33]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[34]  Sree Harsha Ramesh,et al.  Neural Machine Translation for Low Resource Languages using Bilingual Lexicon Induced from Comparable Corpora , 2018, NAACL.

[35]  Junichi Tsujii,et al.  Bilingual Dictionary Extraction from Wikipedia , 2009, MTSUMMIT.

[36]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[37]  Pablo Gamallo Otero,et al.  Wikipedia as Multilingual Source of Comparable Corpora , 2011 .

[38]  Takahiro Hara,et al.  An Approach for Extracting Bilingual Terminology from Wikipedia , 2008, DASFAA.

[39]  Alberto Barrón-Cedeño,et al.  A Factory of Comparable Corpora from Wikipedia , 2015, BUCC@ACL/IJCNLP.

[40]  Benno Stein,et al.  A Wikipedia-Based Multilingual Retrieval Model , 2008, ECIR.

[41]  Bruno Pouliquen,et al.  Automatic Identification of Document Translations in Large Multilingual Document Collections , 2006, ArXiv.

[42]  Inguna Skadina,et al.  ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora , 2012, ACL.

[43]  Iryna Gurevych,et al.  Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary , 2008, LREC.

[44]  Brent J. Hecht,et al.  The_Tower_of_Babel.jpg: Diversity of Visual Encyclopedic Knowledge Across Wikipedia Language Editions , 2018, ICWSM.

[45]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[46]  Eiichiro Sumita,et al.  Method for Building Sentence-Aligned Corpus from Wikipedia , 2008 .

[47]  Gerlof Bouma,et al.  Normalized (pointwise) mutual information in collocation extraction , 2009 .

[48]  Holger Schwenk,et al.  Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[49]  Iryna Gurevych,et al.  Analysis of the Wikipedia Category Graph for NLP Applications , 2007 .

[50]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[51]  Catherine Dehon,et al.  Influence functions of the Spearman and Kendall correlation measures , 2010, Stat. Methods Appl..

[52]  Evangelos Kanoulas,et al.  A light way to collect comparable corpora from the Web , 2012, LREC.

[53]  Pascale Fung,et al.  Rare Word Translation Extraction from Aligned Comparable Documents , 2011, ACL.