Twitter As a Multilingual Source of Comparable Corpora

This article describes a new method to build comparable corpora from Twitter. Our strategy relies on the fact that Twitter is one of the most popular online social microblog allowing large audiences to express their thoughts and reactions about specific events or breaking news in various languages. Given two languages and a particular topic, We propose the exploitation of tweets in the two selected languages whose content is focused on the selected topic from the microblog Twitter in order to construct a comparable corpus.

[1]  Philippe Langlais,et al.  Translating Government Agencies’ Tweet Feeds: Specificities, Problems and (a few) Solutions , 2013 .

[2]  Ruslan Mitkov,et al.  CLIR- and ontology-based approach for bilingual extraction of comparable documents , 2012 .

[3]  Tony McEnery,et al.  Chapter 2. Parallel and Comparable Corpora: What is Happening? , 2007 .

[4]  Graeme Hirst,et al.  Cross-Lingual Distributional Profiles of Concepts for Measuring Semantic Distance , 2007, EMNLP.

[5]  Rada Mihalcea,et al.  Measuring the Semantic Similarity of Texts , 2005, EMSEE@ACL.

[6]  Motaz Saad,et al.  Extracting Comparable Articles from Wikipedia and Measuring their Comparabilities , 2013 .

[7]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[8]  Malek Hajjem,et al.  Building comparable corpora from social networks , 2014 .

[9]  Bogdan Babych,et al.  Measuring Comparability of Documents in Non-Parallel Corpora for Efficient Extraction of (Semi-)Parallel Translation Equivalents , 2012, ESIRMT/HyTra@EACL.

[11]  Éric Gaussier,et al.  Improving Corpus Comparability for Bilingual Lexicon Extraction from Comparable Corpora , 2010, COLING.

[12]  Richard Xiao,et al.  Parallel and comparable corpora: What are they up to? , 2007 .

[13]  Joemon M. Jose,et al.  Building a large-scale corpus for evaluating event detection on twitter , 2013, CIKM.

[14]  Patrick Paroubek,et al.  Twitter as a Comparable Corpus to build Multilingual Affective Lexicons , 2014 .

[15]  Fatiha Boubekeur Contribution à la définition de modèles de recherche d'information flexibles basés sur les CP-Nets , 2008 .