A Lexical Database for Modern Standard Arabic Interoperable with a Finite State Morphological Transducer

Current Arabic lexicons, whether computational or otherwise, make no distinction between entries from Modern Standard Arabic (MSA) and Classical Arabic (CA), and tend to include obsolete words that are not attested in current usage. We address this problem by building a large-scale, corpus-based lexical database that is representative of MSA. We use an MSA corpus of 1,089,111,204 words, a pre-annotation tool, machine learning techniques, and knowledge-based templatic matching to automatically acquire and filter lexical knowledge about morpho-syntactic attributes and inflection paradigms. Our lexical database is scalable, interoperable and suitable for constructing a morphological analyser, regardless of the design approach and programming language used. The database is formatted according to the international ISO standard in lexical resource representation, the Lexical Markup Framework (LMF). This lexical database is used in developing an open-source finite-state morphological processing toolkit. We build a web application, AraComLex (Arabic Computer Lexicon), for managing and curating the lexical database.

[1]  Jaroslav Stetkevych,et al.  The Modern Arabic Literary Language: Lexical and Stylistic Developments , 1970 .

[2]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[3]  J. Milton Cowan,et al.  A dictionary of modern written Arabic : (Arabic-English) , 1980 .

[4]  Markus Walther Computational nonlinear morphology with emphasis on semitic languages , 2002, Computational Linguistics.

[5]  Robert Hetzron,et al.  Semitic Languages , 1954, PMLA/Publications of the Modern Language Association of America.

[6]  John Sinclair,et al.  Looking up : an account of the COBUILD Project in lexical computing and the development of the Collins COBUILD English Language Dictionary , 1987 .

[7]  Ali Farghaly,et al.  Roots & patterns vs. stems plus grammar-lexis specifications: on what basis should a multilingual database centred on Arabic be built? , 2003, MTSUMMIT.

[8]  Louisa Sadler,et al.  Structural Non-Correspondence in Translation , 1991, EACL.

[9]  Edward William Lane,et al.  Arabic-English Lexicon , 2003 .

[10]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[11]  Frank Mueller,et al.  Preface , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[12]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[13]  Eric Atwell,et al.  The design of a corpus of Contemporary Arabic , 2006 .

[14]  Mans Hulden,et al.  Foma: a Finite-State Compiler and Library , 2009, EACL.

[15]  Josef van Genabith,et al.  An Automatically Built Named Entity Lexicon for Arabic , 2010, LREC.

[16]  Mark Van Mol,et al.  The development of a new learner’s dictionary for Modern Standard Arabic: the linguistic corpus approach , 2000 .

[17]  Jeffrey Heath,et al.  Understanding Arabic: Essays in Contemporary Arabic Linguistics in Honor of El-Said Badawi , 1996 .

[18]  Alaa Elgibali,et al.  Understanding Arabic: Essays in Contemporary Linguistics in Honor of El-Said Badawi , 1998 .

[19]  Salem Ghazali,et al.  Dictionary Definitions and Corpus_Based Evidence in Modern Standard Arabic , 2001 .

[20]  Claudia Soria,et al.  Multilingual resources for NLP in the lexical markup framework (LMF) , 2008, Lang. Resour. Evaluation.

[21]  Mohammed A. Attia An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic Modeling Finite State Networks , 2006, BCS.

[22]  Nizar Habash,et al.  Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking , 2008, ACL.

[23]  Jonathan Owens The Arabic Grammatical Tradition , 1997 .

[24]  J. Orbach Principles of Neurodynamics. Perceptrons and the Theory of Brain Mechanisms. , 1962 .

[25]  Jacob M. Landau A Word Count Of Modern Arabic Prose , 2011 .

[26]  A. J. Arberry Oriental essays : portraits of seven scholars , 1961 .

[27]  Kenneth R. Beesley,et al.  Finite-State Morphological Analysis and Generation of Arabic at Xerox Research: Status and Plans in 2001 , 2001 .

[28]  William D. Marslen-Wilson,et al.  Aralex: A lexical database for Modern Standard Arabic , 2010, Behavior research methods.

[29]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[30]  Lauri Karttunen,et al.  Finite State Morphology , 2003, CSLI Studies in Computational Linguistics.

[31]  Mark van Mol,et al.  Variation in Modern Standard Arabic in Radio News Broadcasts A Synchronic Descriptive Investigation into the Use of Complementary Particles , 2003 .

[32]  J. McCarthy The phonology and morphology of Arabic , 2004 .

[33]  J. M. Cowan,et al.  A dictionary of modern written Arabic , 1963 .

[34]  Bertold Spuler Arthur John Arberry : Oriental 1 Essays. Portraits of Seven Scholars. London: George Allen and Unwin, 1960. 261 S. 28 s , 1961 .

[35]  Moshe Brill,et al.  The basic word list of the Arabic daily newspaper , 1943 .

[36]  R. Lew The Oxford Guide to Practical Lexicography , 2009 .

[37]  K. R. Beesley Arabic Morphological Analysis on the Internet , 2007 .

[38]  Josef van Genabith,et al.  Automatic Extraction of Arabic Multiword Expressions , 2010, MWE@COLING.

[39]  Musaed S. Bin-Muqbil PHONETIC AND PHONOLOGICAL ASPECTS OF ARABIC EMPHATICS AND GUTTURALS , 2006 .

[40]  Jan Hajiÿc,et al.  Feature-Based Tagger of Approximations of Functional Arabic Morphology , 2005 .