An Automatically Built Named Entity Lexicon for Arabic

We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWN’s instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest, most mature and well-structured Arabic NE lexical resource to date. We have stored and organised this lexicon following the Lexical Markup Framework (LMF) ISO standard. We conduct a quantitative and qualitative evaluation of the lexicon against a manually annotated gold standard and achieve precision scores from 95.83% (with 66.13% recall) to 99.31% (with 61.45% recall) according to different values of a threshold.

[1]  Bilel Gargouri,et al.  Modélisation des paradigmes de flexion des verbes arabes selon la norme LMF - ISO 24613 , 2007 .

[2]  Husni Al-Muhtaseb,et al.  Statistical Methods for Automatic diacritization of Arabic text , 2006 .

[3]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[4]  Piek Vossen,et al.  EuroWordNet: A multilingual database with lexical semantic networks , 1998, Springer Netherlands.

[5]  Antonio Toral,et al.  Applying Wikipedia's Multilingual Knowledge to Cross-Lingual Question Answering , 2007, NLDB.

[6]  Farid Meziane,et al.  A Rule Based Persons Names Arabic Extraction System , 2009 .

[7]  Khaled Shaalan,et al.  NERA: Named Entity Recognition for Arabic , 2009, J. Assoc. Inf. Sci. Technol..

[8]  Yassine Benajiba,et al.  Arabic Named Entity Recognition using Optimized Feature Sets , 2008, EMNLP.

[9]  Christiane Fellbaum,et al.  Arabic WordNet. Current State and Future Extensions , 2008 .

[10]  CenterUniversitat Polit,et al.  Making Wordnet Mappings Robust , 2003 .

[11]  Fredric C. Gey Research to Improve Cross-Language Retrieval - Position Paper for CLEF , 2000, CLEF.

[12]  Stuart M. Shieber,et al.  Arabic Diacritization Using Weighted Finite-State Transducers , 2005, SEMITIC@ACL.

[13]  Nizar Habash,et al.  Improving NER in Arabic Using a Morphological Tagger , 2008, LREC.

[14]  Antonio Toral Ruiz Enrichment of language resources by exploiting new text and the resources themselves a case study on the acquisition of a ne lexicon , 2009 .

[15]  Yassine Benajiba,et al.  ANERsys 2.0: Conquering the NER Task for the Arabic Language by Combining the Maximum Entropy with POS-tag Information , 2007, IICAI.

[16]  Ruhi Sarikaya,et al.  Arabic diacritic restoration approach based on maximum entropy models , 2009, Comput. Speech Lang..

[17]  Claudia Soria,et al.  Multilingual resources for NLP in the lexical markup framework (LMF) , 2008, Lang. Resour. Evaluation.

[18]  Lluís Padró,et al.  Making Wordnet Mapping Robust , 2003, Proces. del Leng. Natural.

[19]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[20]  Horacio Rodríguez,et al.  Automatically Extending NE coverage of Arabic WordNet using Wikipedia , 2009 .

[21]  John Maloney,et al.  TAGARAB: A Fast, Accurate Arabic Name Recognizer Using High-Precision Morphological Analysis , 1998, SEMITIC@COLING.

[22]  Christiane Fellbaum,et al.  Building a WordNet for Arabic , 2006, LREC.

[23]  Slim Mesfar,et al.  Named Entity Recognition for Arabic Using Syntactic Grammars , 2007, NLDB.

[24]  Saleem Abuleil,et al.  Extracting Names From Arabic Text for Question-Answering Systems , 2004, RIAO.