A hybrid approach to Arabic named entity recognition

In this paper, we propose a hybrid named entity recognition (NER) approach that takes the advantages of rule-based and machine learning-based approaches in order to improve the overall system performance and overcome the knowledge elicitation bottleneck and the lack of resources for underdeveloped languages that require deep language processing, such as Arabic. The complexity of Arabic poses special challenges to researchers of Arabic NER, which is essential for both monolingual and multilingual applications. We used the hybrid approach to develop an Arabic NER system that is capable of recognizing 11 types of Arabic named entities: Person, Location, Organization, Date, Time, Price, Measurement, Percent, Phone Number, ISBN and File Name. Extensive experiments were conducted using decision trees, Support Vector Machines and logistic regression classifiers to evaluate the system performance. The empirical results indicate that the hybrid approach outperforms both the rule-based and the ML-based approaches when they are processed independently. More importantly, our system outperforms the state-of-the-art of Arabic NER in terms of accuracy when applied to ANERcorp standard dataset, with F-measures 0.94 for Person, 0.90 for Location and 0.88 for Organization.

[1]  Slim Mesfar,et al.  Named Entity Recognition for Arabic Using Syntactic Grammars , 2007, NLDB.

[2]  Jeong-Seok Kim,et al.  Named Entity Recognition using Machine Learning Methods and Pattern-Selection Rules , 2001, NLPRS.

[3]  Yassine Benajiba,et al.  ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy , 2009, CICLing.

[4]  Khaled Shaalan,et al.  Integrating Rule-Based System with Classification for Arabic Named Entity Recognition , 2012, CICLing.

[5]  Khaled Shaalan,et al.  A Pipeline Arabic Named Entity Recognition using a Hybrid Approach , 2012, COLING.

[6]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[7]  Nizar Habash,et al.  Improving NER in Arabic Using a Morphological Tagger , 2008, LREC.

[8]  John Maloney,et al.  TAGARAB: A Fast, Accurate Arabic Name Recognizer Using High-Precision Morphological Analysis , 1998, SEMITIC@COLING.

[9]  Günter Neumann,et al.  Arabic Computational Morphology: Knowledge-based and Empirical Methods , 2007 .

[10]  Dimitris Christodoulakis,et al.  Decision Trees and NLP: A Case Study in POS Tagging , 2009 .

[11]  Rohini K. Srihari,et al.  A Hybrid Approach for Named Entity and Sub-Type Tagging , 2000, ANLP.

[12]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[13]  Bogdan Babych,et al.  Improving Machine Translation Quality with Automatic Named Entity Recognition , 2003, Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools Resources and Tools for Building MT - EAMT '03.

[14]  Nizar Habash,et al.  Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking , 2008, ACL.

[15]  Christine D. Piatko,et al.  Named Entity Recognition using Hundreds of Thousands of Features , 2003, CoNLL.

[16]  Shih-Hung Wu,et al.  Mencius: A Chinese Named Entity Recognizer Using the Maximum Entropy-based Hybrid Model , 2004, Int. J. Comput. Linguistics Chin. Lang. Process..

[17]  Yassine Benajiba,et al.  ANERsys 2.0: Conquering the NER Task for the Arabic Language by Combining the Maximum Entropy with POS-tag Information , 2007, IICAI.

[18]  Yorick Wilks,et al.  Named Entity Recognition from Diverse Text Types , 2001 .

[19]  Khaled Shaalan,et al.  Rule-based Approach in Arabic Natural Language Processing , 2010 .

[20]  Mohd Juzaiddin Ab Aziz,et al.  Arabic Person Names Recognition by using a Rule based Approach , 2013, J. Comput. Sci..

[21]  Georgios Paliouras,et al.  Using Machine Learning to Maintain Rule-based Named-Entity Recognition and Classification Systems , 2001, ACL.

[22]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[23]  Nizar Habash Arabic Natural Language Processing , 2008 .

[24]  Kareem Darwish,et al.  Simplified Feature Set for Arabic Named Entity Recognition , 2010, NEWS@ACL.

[25]  Khaled Shaalan,et al.  Person Name Entity Recognition for Arabic , 2007, SEMITIC@ACL.

[26]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[27]  Khaled Shaalan,et al.  NERA: Named Entity Recognition for Arabic , 2009, J. Assoc. Inf. Sci. Technol..

[28]  Farid Meziane,et al.  A Rule Based Persons Names Arabic Extraction System , 2009 .

[29]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[30]  Wajdi Zaghouani,et al.  RENAR: A Rule-Based Arabic Named Entity Recognition System , 2012, TALIP.

[31]  Yassine Benajiba,et al.  Arabic Named Entity Recognition using Conditional Random Fields , 2008 .

[32]  Kashif Riaz,et al.  Rule-Based Named Entity Recognition in Urdu , 2010, NEWS@ACL.

[33]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[34]  Mona T. Diab,et al.  Arabic Named Entity Recognition: An SVM-based approach , 2008 .

[35]  Paolo Rosso,et al.  IDRAAQ: New Arabic Question Answering System Based on Query Expansion and Passage Retrieval , 2012, CLEF.

[36]  Yassine Benajiba,et al.  Arabic Named Entity Recognition: A Feature-Driven Study , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[38]  Yassine Benajiba,et al.  Using Language Independent and Language Specific Features to Enhance Arabic Named Entity Recognition , 2009, Int. Arab J. Inf. Technol..

[39]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[40]  Christopher D. Manning,et al.  Nested Named Entity Recognition , 2009, EMNLP.

[41]  Khaled Shaalan,et al.  Arabic Named Entity Recognition from Diverse Text Types , 2008, GoTAL.

[42]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[43]  Khaled Shaalan,et al.  Arabic Natural Language Processing: Challenges and Solutions , 2009, TALIP.

[44]  Marwa Magdy,et al.  Integrated Machine Learning Techniques for Arabic Named Entity Recognition , 2010 .

[45]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[46]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[47]  Yassine Benajiba,et al.  Arabic Named Entity Recognition using Optimized Feature Sets , 2008, EMNLP.

[48]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[49]  Nazlia Omar,et al.  Arabic Named Entity Recognition Using Artificial Neural Network , 2012 .

[50]  Nizar Habash,et al.  MADA+TOKAN Manual , 2010 .

[51]  Ibrahim A. Al-Kharashi,et al.  Arabic morphological analysis techniques: A comprehensive survey , 2004, J. Assoc. Inf. Sci. Technol..

[52]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[53]  Adnan Yazici,et al.  A hybrid named entity recognizer for Turkish , 2012, Expert Syst. Appl..

[54]  Khaled Shaalan,et al.  Person Name Recognition Using the Hybrid Approach , 2013, NLDB.