Arabic Named Entity Recognition: An SVM-based approach

The Named Entity Recognition (NER) task has been garnering significant attention as it has been shown to help improve the performance of many Natural Language Processing (NLP) applications. More recently, we are starting to see a surge in developing NER systems for languages other than English. With the relative abundance of resources for the Arabic language and a certain degree of maturation in the state of the art for processing Arabic, it is natural to see interest in developing NER systems for the language. In this paper, we investigate the impact of using different sets of features that are both language independent and language specific in a discriminative machine learning framework, namely, Support Vector Machines. We explore lexical, contextual and morphological features and nine data-sets of different genres and annotations. We systematically measure the impact of the different features in isolation and combined. We achieve the highest performance using a combination of all features. Combining all the features, our system yields an F1=82.71. Essentially combining language independent features with language specific ones yields the best performance on all the genres of text we investigate.

[1]  John Maloney,et al.  TAGARAB: A Fast, Accurate Arabic Name Recognizer Using High-Precision Morphological Analysis , 1998, SEMITIC@COLING.

[2]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[3]  Yuji Matsumoto,et al.  Chunking with Support Vector Machines , 2001, NAACL.

[4]  Rob Malouf,et al.  Markov Models for Language-independent Named Entity Recognition , 2002, CoNLL.

[5]  Hwee Tou Ng,et al.  Named Entity Recognition with a Maximum Entropy Approach , 2003, CoNLL.

[6]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[7]  Christine D. Piatko,et al.  Named Entity Recognition using Hundreds of Thousands of Features , 2003, CoNLL.

[8]  James R. Curran,et al.  Language Independent NER using a Maximum Entropy Tagger , 2003, CoNLL.

[9]  Wei Li,et al.  Rapid development of Hindi named entity recognition using conditional random fields and feature induction , 2003, TALIP.

[10]  Hermann Ney,et al.  Maximum Entropy Models for Named Entity Recognition , 2003, CoNLL.

[11]  Bogdan Babych,et al.  Improving Machine Translation Quality with Automatic Named Entity Recognition , 2003, Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools Resources and Tools for Building MT - EAMT '03.

[12]  Saleem Abuleil,et al.  Extracting Names From Arabic Text for Question-Answering Systems , 2004, RIAO.

[13]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[14]  Xiaoqiang Luo,et al.  The Impact of Morphological Stemming on Arabic Mention Detection and Coreference Resolution , 2005, SEMITIC@ACL.

[15]  Ryoji Kataoka,et al.  A search result clustering method using informatively named entities , 2005, WIDM '05.

[16]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[17]  Wen-Lian Hsu,et al.  On Using Ensemble Methods for Chinese Named Entity Recognition , 2006, SIGHAN@COLING/ACL.

[18]  Nizar Habash,et al.  Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowledged. Arabic Preprocessing Schemes for Statistical Machine Translation , 2006 .

[19]  Alessandro Moschitti,et al.  CUNIT: A Semantic Role Labeling System for Modern Standard Arabic , 2007, SemEval@ACL.

[20]  Nigel Collier,et al.  Named Entity Recognition in Vietnamese documents , 2007 .

[21]  Yassine Benajiba,et al.  ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy , 2009, CICLing.

[22]  Günter Neumann,et al.  Arabic Computational Morphology: Knowledge-based and Empirical Methods , 2007 .

[23]  Christiane Fellbaum,et al.  SemEval-2007 Task 18: Arabic Semantic Labeling , 2007, SemEval@ACL.

[24]  Yassine Benajiba,et al.  ANERsys 2.0: Conquering the NER Task for the Arabic Language by Combining the Maximum Entropy with POS-tag Information , 2007, IICAI.