A hybrid possibilistic approach for Arabic full morphological disambiguation

Abstract Morphological ambiguity is an important phenomenon affecting several tasks in Arabic text analysis, indexing and mining. Nevertheless, it has not been well studied in related works. We investigate, in this paper, new approaches to disambiguate the morphological features of non-vocalized Arabic texts, combining statistical classification and linguistic rules. Indeed, we perform unsupervised training from unlabelled vocalized Arabic corpora. Thus, the training and testing sets contain imperfect instances (i.e. having ambiguous attributes and/or classes). To handle imperfect data, we compare two approaches: i) a possibilistic approach allowing to handle imperfection in a direct manner; and, ii) a data transformation-based approach permitting to convert an imperfect dataset to a perfect one, thus allowing to exploit classical classifiers. We also present an approach dealing with unknown (Out-of-Vocabulary) words. The experiments focus mainly on classical texts, which were not sufficiently studied in related works. We show that the possibilistic approach performs better than the transformation-based one. Besides, we report encouraging results as far as i) the role of linguistic rules in enhancing the disambiguation rates; and, ii) the accuracy of our approach for full morphological disambiguation of unknown words.

[1]  Yan Yue A Multi-Classified Method of Support Vector Machine (SVM) Based on Entropy , 2012 .

[2]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[3]  Clive Holes,et al.  Modern Arabic: Structures, Functions, and Varieties , 1996 .

[4]  Narjès Bellamine Ben Saoud,et al.  Improving Arabic Texts Morphological Disambiguation Using a Possibilistic Classifier , 2014, NLDB.

[5]  Ibrahim Bounhas,et al.  Information Reliability Evaluation , 2015, ACM Journal on Computing and Cultural Heritage.

[6]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[7]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[8]  S. Khoja,et al.  APT: Arabic Part-of-speech Tagger , 2001 .

[9]  Mohamed Ben Ahmed,et al.  Towards an intelligent possibilistic web information retrieval using multiagent system , 2009, Interact. Technol. Smart Educ..

[10]  Ibrahim Bounhas,et al.  Organizing Contextual Knowledge for Arabic Text Disambiguation and Terminology Extraction , 2011 .

[11]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[12]  Bilel Elayeb,et al.  SARIPOD: Système multi-Agent de Recherche Intelligente POssibiliste de Documents Web. (SARIPOD: An Intelligent Possibilistic Web Information Retrieval using Multiagent System) , 2009 .

[13]  Yousif A. El-Imam Phonetization of Arabic: rules and algorithms , 2004, Comput. Speech Lang..

[14]  Mark J. F. Gales,et al.  Morphological decomposition in Arabic ASR systems , 2012, Comput. Speech Lang..

[15]  Mathieu Serrurier,et al.  Possibilistic classifiers for numerical data , 2013, Soft Comput..

[16]  Narjès Bellamine Ben Saoud,et al.  Evaluation of a possibilistic classification approach for Arabic texts disambiguation (Evaluation d'une approche de classification possibiliste pour la désambiguïsation des textes arabes) [in French] , 2014, TALN.

[17]  Nizar Habash,et al.  Automatic Morphological Enrichment of a Morphologically Underspecified Treebank , 2013, NAACL.

[18]  Aqil M. Azmi,et al.  A text summarizer for Arabic , 2012, Comput. Speech Lang..

[19]  Khaled Mellouli,et al.  Naïve possibilistic network classifiers , 2009, Fuzzy Sets Syst..

[20]  Daoud Daoud Synchronized Morphological and Syntactic Disambiguation for Arabic , 2009 .

[21]  Kais Dukes,et al.  Statistical Parsing by Machine Learning from a Classical Arabic Treebank , 2015, ArXiv.

[22]  Alexandre Blansché Classification non supervisée avec pondération d'attributs par des méthodes évolutionnaires , 2006 .

[23]  Daniel Jurafsky,et al.  Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks , 2004, NAACL.

[24]  Ibrahim Bounhas,et al.  Arabic Cross-Language Information Retrieval , 2016, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[25]  Essam Al-Daoud,et al.  A framework to automate the parsing of Arabic language sentences , 2009, Int. Arab J. Inf. Technol..

[26]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[27]  Daoud Daoud,et al.  Arabic Disambiguation using Dependency Grammar , 2009 .

[28]  Josef van Genabith,et al.  The Floating Arabic Dictionary: An Automatic Method for Updating a Lexical Database through the Detection and Lemmatization of Unknown Words , 2012, COLING.

[29]  Nizar Habash,et al.  Morphological Analysis and Disambiguation for Dialectal Arabic , 2013, NAACL.

[30]  Eslam Kamal,et al.  A Hybrid Approach for Arabic Diacritization , 2013, NLDB.

[31]  Nizar Habash,et al.  Arabic Diacritization through Full Morphological Tagging , 2007, NAACL.

[32]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[33]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[34]  King Abdullah,et al.  Knowledge Discovery in Al-Hadith Using Text Classification Algorithm , 2010 .

[35]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[36]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[37]  Chih-Ming Chen,et al.  An efficient fuzzy classifier with feature selection based on fuzzy entropy , 2001, IEEE Trans. Syst. Man Cybern. Part B.

[38]  Naveed Sarfraz Khattak,et al.  Speaker Independent Urdu speech recognition using HMM , 2010, 2010 The 7th International Conference on Informatics and Systems (INFOS).

[39]  Andreas Stolcke,et al.  Morphology-based language modeling for conversational Arabic speech recognition , 2006, Comput. Speech Lang..

[40]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[41]  Didier Dubois,et al.  Possibility Theory: Qualitative and Quantitative Aspects , 1998 .

[42]  Mathieu Serrurier,et al.  Naive possibilistic classifiers for imprecise or uncertain numerical data , 2014, Fuzzy Sets Syst..

[43]  Nizar Habash,et al.  Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking , 2008, ACL.

[44]  Narjès Bellamine Ben Saoud,et al.  Towards a New Standard Arabic Test Collection for Mono- and Cross-Language Information Retrieval , 2014, NLDB.

[45]  Narjès Bellamine Ben Saoud,et al.  A Possibilistic Approach for the Automatic Morphological Disambiguation of Arabic Texts , 2012, 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing.

[46]  Stephan Vogel,et al.  Context-based Arabic Morphological Analysis for Machine Translation , 2008, CoNLL.

[47]  Narjès Bellamine Ben Saoud,et al.  Arabic Morphological Analysis and Disambiguation Using a Possibilistic Classifier , 2012, ICIC.

[48]  Khaled Shaalan,et al.  Handling Unknown Words in Arabic FST Morphology , 2012, FSMNLP.

[49]  Jan Hajic,et al.  Morphological Tagging: Data vs. Dictionaries , 2000, ANLP.

[50]  Ruhi Sarikaya,et al.  Arabic diacritic restoration approach based on maximum entropy models , 2009, Comput. Speech Lang..

[51]  Narjès Bellamine Ben Saoud,et al.  A comparative study between possibilistic and probabilistic approaches for monolingual word sense disambiguation , 2014, Knowledge and Information Systems.

[52]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[53]  Narjès Bellamine Ben Saoud,et al.  Towards a Possibilistic Information Retrieval System Using Semantic Query Expansion , 2011, Int. J. Intell. Inf. Technol..

[54]  Narjès Bellamine Ben Saoud,et al.  Experimenting a discriminative possibilistic classifier with reweighting model for Arabic morphological disambiguation , 2015, Comput. Speech Lang..

[55]  Fouzi Harrag,et al.  Ontology Extraction Approach for Prophetic Narration (Hadith) using Association Rules , 2013 .