La construction automatique de ressources multilingues à partir des réseaux sociaux : application aux données dialectales du Maghreb. (Automatic building of multilingual resources from social networks : application to Maghrebi dialects)

Le traitement automatique des langues est fonde sur l'utilisation des ressources langagieres telles que les corpus de textes, les dictionnaires, les lexiques de sentiments, les analyseurs morpho-syntaxiques, les taggers, etc. Pour les langues naturelles, ces ressources sont souvent disponibles. En revanche, lorsqu'il est question de traiter les langues peu dotees, on est souvent confronte au manque d'outils et de donnees. Dans cette these, on s'interesse a certaines formes vernaculaires de l'arabe utilisees au Maghreb. Ces formes sont connues sous le terme de dialecte que l'on peut classer dans la categorie des langues peu dotees. Exceptes des textes brutes extraits generalement des reseaux sociaux, il existe tres peu de ressources permettant de traiter les dialectes arabes. Ces derniers, comparativement aux autres langues peu dotees possedent plusieurs specificites qui les rendent plus difficile a traiter. Nous pouvons citer notamment l'absence de regles d'ecriture de ces dialectes, ce qui conduit les usagers a ecrire le dialecte sans suivre des regles precises, par consequent un meme mot peut avoir plusieurs graphies. Les mots en arabe dialectal peuvent s’ecrire en utilisant le script arabe et/ou le script latin (ecriture dite arabizi). Pour les dialectes arabes du Maghreb, ils sont particulierement influences par des langues etrangeres comme le francais et l'anglais. En plus de l'emprunt de mots de ces langues, un autre phenomene est a prendre en compte en traitement automatique des dialectes. Il s'agit du probleme connu sous le terme de code-switching. Ce phenomene est connu en linguistique sous le terme de diglossie. Cela a pour consequence de laisser libre cours a l’utilisateur qui peut ecrire en plusieurs langues dans une meme phrase. Il peut ainsi commencer en dialecte arabe et au milieu de la phrase, il peut "switcher" vers le francais, l'anglais ou l’arabe standard. En plus de cela, il existe plusieurs dialectes dans un meme pays et a fortiori plusieurs dialectes differents dans le monde arabe. Il est donc clair que les outils NLP classiques developpes pour l’arabe standard ne peuvent etre utilises directement pour traiter les dialectes. L'objectif principal de ce travail consiste a proposer des methodes permettant la construction automatique de ressources pour les dialectes arabes en general et les dialectes du Maghreb en particulier. Cela represente notre contribution a l'effort fourni par la communaute travaillant sur le traitement automatique des dialectes arabes. Nous avons ainsi produit des methodes permettant de construire des corpus comparables, des ressources lexicales contenant les differentes formes d'une entree et leur polarite. Par ailleurs, nous avons developpe des methodes pour le traitement de l'arabe standard sur des donnees de Twitter et egalement sur les transcriptions provenant d'un systeme de reconnaissance automatique de la parole operant sur des videos en arabe extraites de chaines de televisions arabes telles que Al Jazeera, France24, Euronews, etc. Nous avons ainsi compare les opinions des transcriptions automatiques provenant de sources videos multilingues differentes et portant sur le meme sujet en developpant une methode fondee sur la theorie linguistique dite Appraisal.

[1]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[2]  Lars Borin,et al.  SenSALDO: Creating a Sentiment Lexicon for Swedish , 2018, LREC.

[3]  Kyo Kageura,et al.  Brains, not brawn: The use of “smart” comparable corpora in bilingual terminology mining , 2008, TSLP.

[4]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[5]  Kemal Oflazer,et al.  The MADAR Arabic Dialect Corpus and Lexicon , 2018, LREC.

[6]  Saif Mohammad,et al.  How Translation Alters Sentiment , 2016, J. Artif. Intell. Res..

[7]  Kara T. McAlister,et al.  Linguistic Constraints on Codeswitching and Codemixing of Bilingual Moroccan Arabic-French Speakers in Canada , 2004 .

[8]  Andreas Eisele,et al.  MultiUN: A Multilingual Corpus from United Nation Documents , 2010, LREC.

[9]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[10]  Muazzam Ahmed Siddiqui,et al.  Building an Arabic Sentiment Lexicon Using Semi-supervised Learning , 2014, J. King Saud Univ. Comput. Inf. Sci..

[11]  Éric Gaussier,et al.  Improving Corpus Comparability for Bilingual Lexicon Extraction from Comparable Corpora , 2010, COLING.

[12]  Kamel Smaïli,et al.  Building Parallel Corpora from Movies , 2007 .

[13]  Karima Meftouh,et al.  The SMarT Classifier for Arabic Fine-Grained Dialect Identification , 2019, WANLP@ACL 2019.

[14]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[15]  Karima Meftouh,et al.  Automatic Identification Methods on a Corpus of Twenty Five Fine-Grained Arabic Dialects , 2019, ICALP.

[16]  Saif Mohammad,et al.  CROWDSOURCING A WORD–EMOTION ASSOCIATION LEXICON , 2013, Comput. Intell..

[17]  Holger Schwenk,et al.  Traduction automatique à partir de corpus comparables: extraction de phrases parallèles à partir de données comparables multimodales (Automatic Translation from Comparable corpora : extracting parallel sentences from multimodal comparable corpora) [in French] , 2012, JEP/TALN/RECITAL.

[18]  Kamel Smaïli,et al.  An Automatic Learning of an Algerian Dialect Lexicon by using Multilingual Word Embeddings , 2018, LREC.

[19]  Fatemeh Amiri,et al.  Lexicon-based Sentiment Analysis for Persian Text , 2015, RANLP.

[20]  Motaz Saad,et al.  Extracting Comparable Articles from Wikipedia and Measuring their Comparabilities , 2013 .

[21]  Éric Gaussier,et al.  Clustering Comparable Corpora For Bilingual Lexicon Extraction , 2011, ACL.

[22]  J. Gafaranga,et al.  Interactional otherness: Towards a redefinition of codeswitching , 2002 .

[23]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[24]  Eric Gaussier,et al.  Une nouvelle approche à l'extraction de lexiques bilingues à partir de corpus comparables , 2007 .

[25]  Ahmad Abdel Tawwab Sharaf Eldin Socio Linguistic Study of Code Switching of the Arabic Language Speakers on Social Networking , 2014 .

[26]  Azadeh Shakery,et al.  Leveraging comparable corpora for cross-lingual information retrieval in resource-lean language pairs , 2012, Information Retrieval.

[27]  Pierre Zweigenbaum,et al.  Analyse des émotions, sentiments et opinions exprimés dans les tweets : présentation et résultats de l'édition 2015 du défi fouille de texte (DEFT) , 2015 .

[28]  Kamel Smaïli,et al.  Measuring the comparability of multilingual corpora extracted from Twitter and others , 2016 .

[29]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[30]  Kareem Darwish,et al.  Arabizi Detection and Conversion to Arabic , 2013, ANLP@EMNLP.

[31]  Michael L. Littman,et al.  Automatic Cross-Language Retrieval Using Latent Semantic Indexing , 1997 .

[32]  Kamel Smaïli,et al.  Genetic-Based Decoder for Statistical Machine Translation , 2016, CICLing.

[33]  Amir Hussain,et al.  SentiALG: Automated Corpus Annotation for Algerian Sentiment Analysis , 2018, BICS.

[34]  Fatima Zahra Aazi,et al.  Analyse des sentiments à partir des commentaires Facebook publiés en Arabe standard ou dialectal marocain par une approche d'apprentissage automatique , 2018, EGC.

[35]  Kamel Smaïli,et al.  Building a bilingual dictionary from movie subtitles based on inter-lingual triggers , 2007, TC.

[36]  Marián Šimko,et al.  Sentiment analysis on microblog utilizing appraisal theory , 2013, World Wide Web.

[37]  Aravind K. Joshi,et al.  Processing of Sentences With Intra-Sentential Code-Switching , 1982, COLING.

[38]  Andrés Montoyo,et al.  Building and Exploiting EmotiNet, a Knowledge Base for Emotion Detection Based on the Appraisal Theory Model , 2012, IEEE Transactions on Affective Computing.

[39]  Yves Bestgen Déterminer automatiquement la valence affective de phrases : Amélioration de l'approche lexicale , 2006 .

[40]  Catherine Havasi,et al.  ConceptNet 5.5: An Open Multilingual Graph of General Knowledge , 2016, AAAI.

[41]  Patrice Bellot,et al.  Identification Semi-Automatique de Mots-Germes pour l'Analyse de Sentiments et son Intensité , 2017, CORIA.

[42]  Pablo Gamallo Otero Measuring Comparability of Multilingual Corpora Extracted from Wikip edia , 2011 .

[43]  Karima Meftouh,et al.  PADIC: extension and new experiments , 2018 .

[44]  Andrea Esuli,et al.  SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining , 2010, LREC.

[45]  Ophir Frieder,et al.  On the development of name search techniques for Arabic , 2006, J. Assoc. Inf. Sci. Technol..

[46]  Khaled Shaalan,et al.  Arabic Natural Language Processing: Challenges and Solutions , 2009, TALIP.

[47]  Emmanuel Morin,et al.  Towards a unified framework for bilingual terminology extraction of single-word and multi-word terms , 2018, COLING.

[48]  Amitava Das,et al.  Identifying Languages at the Word Level in Code-Mixed Indian Social Media Text , 2014, ICON.

[49]  Kamel Smaïli,et al.  A Fine-Grained Multilingual Analysis Based on the Appraisal Theory: Application to Arabic and English Videos , 2019, ICALP.

[50]  Evangelos Kanoulas,et al.  A light way to collect comparable corpora from the Web , 2012, LREC.

[51]  Wolfgang Teubert Comparable or Parallel Corpora , 1996 .

[52]  Emmanuel Prochasson Alignement multilingue en corpus comparables spécialisés. (Multilingual alignment from specialised comparable corpora) , 2009 .

[53]  Shlomo Argamon,et al.  Using appraisal groups for sentiment analysis , 2005, CIKM '05.

[54]  Jeannett Martin,et al.  The Language of Evaluation: Appraisal in English , 2005 .

[55]  Kristina Toutanova,et al.  Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[56]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[57]  Kamel Smaïli,et al.  Cross-Lingual Semantic Similarity Measure for Comparable Articles , 2014, PolTAL.

[58]  Alaa M. El-Halees,et al.  Arabic Opinion Mining Using Combined Classification Approach , 2011 .

[60]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[61]  Houfeng Wang,et al.  Lost in Translations? Building Sentiment Lexicons using Context Based Machine Translation , 2012, COLING.

[62]  Andrei Popescu-Belis,et al.  Machine Translation of Low-Resource Spoken Dialects: Strategies for Normalizing Swiss German , 2017, LREC.

[63]  Pascal Poncelet,et al.  FEEL: a French Expanded Emotion Lexicon , 2016, Language Resources and Evaluation.

[64]  Kemal Oflazer,et al.  A Multidialectal Parallel Corpus of Arabic , 2014, LREC.

[65]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[66]  Emmanuel Morin,et al.  Leveraging Meta-Embeddings for Bilingual Lexicon Extraction from Specialized Comparable Corpora , 2018, COLING.

[67]  Fethi Bougares,et al.  Sentiment Analysis of Tunisian Dialects: Linguistic Ressources and Experiments , 2017, WANLP@EACL.

[68]  Saif Mohammad,et al.  Generating High-Coverage Semantic Orientation Lexicons From Overtly Marked Words and a Thesaurus , 2009, EMNLP.

[69]  Holger Schwenk,et al.  On the Use of Comparable Corpora to Improve SMT performance , 2009, EACL.

[70]  Wang Ling,et al.  Microblogs as Parallel Corpora , 2013, ACL.

[71]  Juan-Manuel Torres-Moreno,et al.  A First Summarization System of a Video in a Target Language , 2018, MISSI.

[72]  Carlo Strapparava,et al.  WordNet Affect: an Affective Extension of WordNet , 2004, LREC.

[73]  Kamel Smaïli,et al.  An empirical study of the Algerian dialect of Social network , 2017 .

[74]  Carlo Strapparava,et al.  SemEval-2007 Task 14: Affective Text , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[75]  Sree Harsha Ramesh,et al.  Neural Machine Translation for Low Resource Languages using Bilingual Lexicon Induced from Comparable Corpora , 2018, NAACL.

[76]  Johanna D. Moore,et al.  Polarity and Intensity: the Two Aspects of Sentiment Analysis , 2018, ArXiv.

[77]  Mikolaj Leszczuk,et al.  Collection, Analysis and Summarization of Video Content , 2018, MISSI.

[78]  A. Alamsyah,et al.  SENTIMENT ANALYSIS BASED ON APPRAISAL THEORY FOR MARKETING INTELLIGENCE IN INDONESIA’S MOBILE PHONE MARKET , 2015 .

[79]  Sahar Ghannay,et al.  Étude sur les représentations continues de mots appliquées à la détection automatique des erreurs de reconnaissance de la parole. (A study of continuous word representations applied to the automatic detection of speech recognition errors) , 2017 .

[80]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[81]  Bin Wang,et al.  Evaluating word embedding models: methods and experimental results , 2019, APSIPA Transactions on Signal and Information Processing.

[82]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[83]  Abdelmajid Ben Hamadou,et al.  Exploiting Emoticons to Generate Emotional Dictionaries from Facebook Pages , 2016 .

[84]  Rada Mihalcea,et al.  Learning Multilingual Subjective Language via Cross-Lingual Projections , 2007, ACL.

[85]  Kamel Smaïli,et al.  An enhanced automatic speech recognition system for Arabic , 2017, WANLP@EACL.

[86]  Valentina Dragos,et al.  Beyond Sentiments and Opinions: Exploring Social Media with Appraisal Categories , 2018, 2018 21st International Conference on Information Fusion (FUSION).

[87]  Karima Meftouh,et al.  Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus , 2015, PACLIC.

[88]  M'hamed Mataoui,et al.  A Proposed Lexicon-Based Sentiment Analysis Approach for the Vernacular Algerian Arabic , 2016, Res. Comput. Sci..

[89]  Pascale Fung,et al.  Multi-level Bootstrapping For Extracting Parallel Sentences From a Quasi-Comparable Corpus , 2004, COLING.

[90]  Jian-Yun Nie,et al.  Effective Stemming for Arabic Information Retrieval , 2006, BCS.

[91]  Pascale Fung,et al.  Finding Terminology Translations from Non-parallel Corpora , 1997, VLC.

[92]  Kamel Smaïli,et al.  Development of the Arabic Loria Automatic Speech Recognition system (ALASR) and its evaluation for Algerian dialect , 2017, ACLING.

[93]  Michael Halliday,et al.  An Introduction to Functional Grammar , 1985 .

[94]  Casey Whitelaw Using Appraisal Taxonomies for Sentiment Analysis , 2005 .

[95]  Philip J. Stone,et al.  The general inquirer: A computer system for content analysis and retrieval based on the sentence as a unit of information , 2007 .

[96]  Peter Auer,et al.  From codeswitching via language mixing to fused lects , 1999 .

[97]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[98]  Michael L. Littman,et al.  Measuring praise and criticism: Inference of semantic orientation from association , 2003, TOIS.

[99]  Yves Bestgen Détermination de la valence affective de termes dans de grands corpus de textes , 2002 .

[100]  Vasileios Hatzivassiloglou,et al.  Predicting the Semantic Orientation of Adjectives , 1997, ACL.

[101]  Alexander M. Fraser,et al.  Improved Machine Translation Performance via Parallel Sentence Extraction from Comparable Corpora , 2004, NAACL.

[102]  Kamel Smaïli,et al.  Is statistical machine translation approach dead , 2017 .

[103]  Iñaki San Vicente,et al.  Automatic Extraction of Bilingual Terms from Comparable Corpora in a Popular Science Domain , 2008 .

[104]  Giacomo Berardi,et al.  Word Embeddings Go to Italy: A Comparison of Models and Training Datasets , 2015, IIR.

[105]  Shlomo Argamon,et al.  Automatically Determining Attitude Type and Force for Sentiment Analysis , 2007, LTC.

[106]  Kamel Smaïli,et al.  About vocabulary adaptation for automatic speech recognition of video data , 2017 .

[107]  Éric Gaussier,et al.  Exploiting Comparable Corpora for Lexicon Extraction: Measuring and Improving Corpus Quality , 2013, Building and Using Comparable Corpora.

[108]  Patrick Pantel,et al.  VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations , 2004, EMNLP.

[109]  Bogdan Babych,et al.  Measuring Comparability of Documents in Non-Parallel Corpora for Efficient Extraction of (Semi-)Parallel Translation Equivalents , 2012, ESIRMT/HyTra@EACL.

[110]  Iñaki Alegria,et al.  Similitud entre Documentos Multilinges de Carácter Científico-Técnico en un Entorno Web , 2007, Proces. del Leng. Natural.

[111]  Benoît Favre,et al.  Building a robust sentiment lexicon with (almost) no resource , 2016, ArXiv.

[112]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[113]  Rehab Duwairi,et al.  Detecting sentiment embedded in Arabic social media - A lexicon-based approach , 2015, J. Intell. Fuzzy Syst..

[114]  Walid Aransa,et al.  Statistical Machine Translation of the Arabic Language , 2015 .

[115]  Pascale Fung,et al.  Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and E , 2004, EMNLP.

[116]  Taro Watanabe,et al.  Bilingual Lexicon Extraction from Comparable Corpora Using Label Propagation , 2012, EMNLP.

[117]  Adam Kilgarriff,et al.  Measures for Corpus Similarity and Homogeneity , 1998, EMNLP.

[118]  Nicholas Asher,et al.  Distilling Opinion in Discourse: A Preliminary Study , 2008, COLING.

[119]  Suresh Manandhar,et al.  Bilingual lexicon extraction from comparable corpora using in-domain terms , 2010, COLING.