Building comparable corpora from social networks

Working with comparable corpora becomes an interesting alternative to rare parallel corpora in different natural language tasks. Therefore many researchers have accentuated the need of large quantities of such corpora and the need to work on their construction. In this paper, we highlight the interest and usefulness of textual data mining in social networks. We propose the extraction of tweets from the microblog Twitter in order to construct a comparable corpus. This work aims to develop a new method for the construction of comparable corpus from twitter that could be used in forthcoming work to construct a bilingual dictionary, using text mining approach.

[1]  Lorraine Goeuriot Découverte et caractérisation des corpus comparables spécialisés , 2009 .

[2]  Stefan Riezler,et al.  Twitter Translation using Translation-Based Cross-Lingual Retrieval , 2012, WMT@NAACL-HLT.

[3]  J. Wiest,et al.  The Arab Spring| Social Media in the Egyptian Revolution: Reconsidering Resource Mobilization Theory , 2011 .

[4]  Pascale Fung,et al.  Finding Terminology Translations from Non-parallel Corpora , 1997, VLC.

[5]  Chengzhi Zhang,et al.  Termhood-Based Comparability Metrics of Comparable Corpus in Special Domain , 2012, CLSW.

[6]  Ruslan Mitkov,et al.  CLIR- and ontology-based approach for bilingual extraction of comparable documents , 2012 .

[7]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[8]  Éric Gaussier,et al.  Degré de comparabilité, extraction lexicale bilingue et recherche d’information interlingue (Degree of comparability, bilingual lexical extraction and cross-language information retrieval) , 2011, JEPTALNRECITAL.

[9]  Graeme Hirst,et al.  Cross-Lingual Distributional Profiles of Concepts for Measuring Semantic Distance , 2007, EMNLP.

[10]  Martti Juhola,et al.  Creating and exploiting a comparable corpus in cross-language information retrieval , 2007, TOIS.

[11]  Eric Gaussier,et al.  Une nouvelle approche à l'extraction de lexiques bilingues à partir de corpus comparables , 2007 .

[12]  Richard Xiao,et al.  Parallel and comparable corpora: What are they up to? , 2007 .

[13]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[14]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[15]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[16]  Éric Gaussier,et al.  Improving Corpus Comparability for Bilingual Lexicon Extraction from Comparable Corpora , 2010, COLING.

[17]  Joemon M. Jose,et al.  Building a large-scale corpus for evaluating event detection on twitter , 2013, CIKM.

[18]  Rada Mihalcea,et al.  Measuring the Semantic Similarity of Texts , 2005, EMSEE@ACL.

[19]  Philippe Langlais,et al.  Translating Government Agencies’ Tweet Feeds: Specificities, Problems and (a few) Solutions , 2013 .

[20]  Motaz Saad,et al.  Extracting Comparable Articles from Wikipedia and Measuring their Comparabilities , 2013 .

[21]  Lamia Hadrich Belguith,et al.  Traduction automatique statistique à partir de corpus comparables : application aux couples de langues arabe-français , 2013, CORIA.

[22]  Martti Juhola,et al.  Focused web crawling in the acquisition of comparable corpora , 2008, Information Retrieval.

[23]  Bogdan Babych,et al.  Measuring Comparability of Documents in Non-Parallel Corpora for Efficient Extraction of (Semi-)Parallel Translation Equivalents , 2012, ESIRMT/HyTra@EACL.

[24]  Philippe Langlais,et al.  Revisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora , 2010, COLING.

[25]  Yun-Chuang Chiao,et al.  Extraction lexicale bilingue à partir de textes médicaux comparables : application à la recherche d'information translangue. (Bilingual lexicon extraction from comparable medical texts: application for cross-language information retrieval) , 2004 .

[26]  Fatiha Boubekeur-Amirouche Contribution à la définition de modèles de recherche d'information flexibles basés sur les CP-Nets , 2008 .

[27]  Cédrick Fairon,et al.  Une approche hybride traduction/correction pour la normalisation des SMS , 2010 .

[28]  Emmanuel Morin,et al.  Comparabilité de corpus et fouille terminologique multilingue , 2006, Trait. Autom. des Langues.