NERA 2.0: Improving coverage and performance of rule-based named entity recognition for Arabic*

Named Entity Recognition (NER) is an essential task for many natural language processing systems, which makes use of various linguistic resources. NER becomes more complicated when the language in use is morphologically rich and structurally complex, such as Arabic. This language has a set of characteristics that makes it particularly challenging to handle. In a previous work, we have proposed an Arabic NER system that follows the hybrid approach, i.e. integrates both rule-based and machine learning-based NER approaches. Our hybrid NER system is the state-of-the-art in Arabic NER according to its performance on standard evaluation datasets. In this article, we discuss a novel methodology for overcoming the coverage drawback of rule-based NER systems in order to improve their performance and allow for automated rule update. The presented mechanism utilizes the recognition decisions made by the hybrid NER system in order to identify the weaknesses of the rule-based component and derive new linguistic rules aiming at enhancing the rule base, which will help in achieving more reliable and accurate results. We used ACE 2004 Newswire standard dataset as a resource for extracting and analyzing new linguistic rules for person, location and organization names recognition. We formulate each new rule based on two distinctive feature groups, i.e. Gazetteers of each type of named entities and Part-of-Speech tags, in particular noun and proper noun. Fourteen new patterns are derived, formulated as grammar rules, and evaluated in terms of coverage. The conducted experiments exploit a POS tagged version of the ACE 2004 NW dataset. The empirical results show that the performance of the enhanced rule-based system, i.e. NERA 2.0, improves the coverage of the previously misclassified person, location and organization named entities types by 69.93 per cent, 57.09 per cent and 54.28 per cent, respectively.

[1]  Khaled Shaalan,et al.  Rule-based Approach in Arabic Natural Language Processing , 2010 .

[2]  Bogdan Babych,et al.  Improving Machine Translation Quality with Automatic Named Entity Recognition , 2003, Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools Resources and Tools for Building MT - EAMT '03.

[3]  Wajdi Zaghouani,et al.  RENAR: A Rule-Based Arabic Named Entity Recognition System , 2012, TALIP.

[4]  Mona T. Diab,et al.  Arabic Named Entity Recognition: An SVM-based approach , 2008 .

[5]  Shashi Narayan,et al.  Proceedings of the 24th International Conference on Computational Linguistics (COLING) , 2012, International Conference on Computational Linguistics.

[6]  Christian Kop,et al.  Proceedings of the 11th international conference on Applications of Natural Language to Information Systems , 2006 .

[7]  Kashif Riaz,et al.  Rule-Based Named Entity Recognition in Urdu , 2010, NEWS@ACL.

[8]  Ibrahim A. Al-Kharashi,et al.  Arabic morphological analysis techniques: A comprehensive survey , 2004, J. Assoc. Inf. Sci. Technol..

[9]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[10]  Paolo Rosso,et al.  IDRAAQ: New Arabic Question Answering System Based on Query Expansion and Passage Retrieval , 2012, CLEF.

[11]  Haizhou Li,et al.  Proceedings of the 2010 Named Entities Workshop , 2010 .

[12]  Yassine Benajiba,et al.  Arabic Named Entity Recognition: A Feature-Driven Study , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[14]  Adnan Yazici,et al.  A hybrid named entity recognizer for Turkish , 2012, Expert Syst. Appl..

[15]  Khaled Shaalan,et al.  Person Name Recognition Using the Hybrid Approach , 2013, NLDB.

[16]  Mohd Juzaiddin Ab Aziz,et al.  Arabic Person Names Recognition by using a Rule based Approach , 2013, J. Comput. Sci..

[17]  Georgios Paliouras,et al.  Using Machine Learning to Maintain Rule-based Named-Entity Recognition and Classification Systems , 2001, ACL.

[18]  Khaled Shaalan,et al.  Integrating Rule-Based System with Classification for Arabic Named Entity Recognition , 2012, CICLing.

[19]  Nizar Habash,et al.  Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking , 2008, ACL.

[20]  Rohini K. Srihari,et al.  A Hybrid Approach for Named Entity and Sub-Type Tagging , 2000, ANLP.

[21]  Carl Vogel,et al.  Proceedings of the 16th International Conference on Computational Linguistics , 1996, COLING 1996.

[22]  John Maloney,et al.  TAGARAB: A Fast, Accurate Arabic Name Recognizer Using High-Precision Morphological Analysis , 1998, SEMITIC@COLING.

[23]  Mirella Lapata,et al.  Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, 6-7 August 2009, Singapore, A meeting of SIGDAT, a Special Interest Group of the ACL , 2009, EMNLP.

[24]  Slim Mesfar,et al.  Named Entity Recognition for Arabic Using Syntactic Grammars , 2007, NLDB.

[25]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[26]  Mona T. Diab,et al.  Named Entity Recognition for Arabic Social Media , 2015, VS@HLT-NAACL.

[27]  ShaalanKhaled,et al.  A hybrid approach to Arabic named entity recognition , 2014 .

[28]  Khaled Shaalan,et al.  A Pipeline Arabic Named Entity Recognition using a Hybrid Approach , 2012, COLING.

[29]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[30]  Khaled Shaalan,et al.  A hybrid approach to Arabic named entity recognition , 2014, J. Inf. Sci..

[31]  Nazlia Omar,et al.  Arabic Named Entity Recognition Using Artificial Neural Network , 2012 .

[32]  Nizar Habash,et al.  MADA+TOKAN Manual , 2010 .

[33]  Andrew G. Clark,et al.  Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) , 2002 .

[34]  Marwa Magdy,et al.  Integrated Machine Learning Techniques for Arabic Named Entity Recognition , 2010 .

[35]  Khaled Shaalan,et al.  A Survey of Arabic Named Entity Recognition and Classification , 2014, CL.

[36]  Yassine Benajiba,et al.  ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy , 2009, CICLing.

[37]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[38]  Bente Maegaard Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools: Resources and Tools for Building MT , 2003 .

[39]  Khaled Shaalan,et al.  Arabic Natural Language Processing: Challenges and Solutions , 2009, TALIP.

[40]  Umapada Pal,et al.  Handwriting Recognition in Indian Regional Scripts: A Survey of Offline Techniques , 2012, TALIP.

[41]  Khaled Shaalan,et al.  Arabic Named Entity Recognition from Diverse Text Types , 2008, GoTAL.

[42]  Kareem Darwish,et al.  Simplified Feature Set for Arabic Named Entity Recognition , 2010, NEWS@ACL.

[43]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[44]  Saudi Arabia,et al.  e-NARRATOR - AN APPLICATION FOR CREATING AN ONTOLOGY OF HADITHS NARRATION TREE SEMANTICALLY AND GRAPHICALLY , 2010 .

[45]  Khaled Shaalan,et al.  Arabic Morphological Generation from Interlingua , 2006, Intelligent Information Processing.

[46]  Yassine Benajiba,et al.  Using Language Independent and Language Specific Features to Enhance Arabic Named Entity Recognition , 2009, Int. Arab J. Inf. Technol..

[47]  Kalina Bontcheva,et al.  Text Processing with GATE , 2011 .

[48]  Charles H. Davis,et al.  American Society for Information Science and Technology (ASIST) , 2010 .

[49]  Khaled Shaalan,et al.  Person Name Entity Recognition for Arabic , 2007, SEMITIC@ACL.

[50]  Yassine Benajiba,et al.  Arabic Named Entity Recognition using Conditional Random Fields , 2008 .

[51]  Leah S. Larkey,et al.  Arabic Information Retrieval at UMass in TREC-10 , 2001, TREC.

[52]  Khaled Shaalan,et al.  A Hybrid Approach for Building Arabic Diacritizer , 2009, SEMITIC@EACL.

[53]  Farid Meziane,et al.  A Rule Based Persons Names Arabic Extraction System , 2009 .

[54]  Christine D. Piatko,et al.  Named Entity Recognition using Hundreds of Thousands of Features , 2003, CoNLL.

[55]  Jeong-Seok Kim,et al.  Named Entity Recognition using Machine Learning Methods and Pattern-Selection Rules , 2001, NLPRS.

[56]  Christopher D. Manning,et al.  Nested Named Entity Recognition , 2009, EMNLP.

[57]  Barry Haddow,et al.  Proceedings of NAACL-HLT 2013 , 2013 .

[58]  Nizar Habash,et al.  Improving NER in Arabic Using a Morphological Tagger , 2008, LREC.

[59]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[60]  Yassine Benajiba,et al.  Arabic Named Entity Recognition using Optimized Feature Sets , 2008, EMNLP.

[61]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[62]  Nizar Habash,et al.  Elissa: A Dialectal to Standard Arabic Machine Translation System , 2012, COLING.

[63]  Khaled Shaalan,et al.  NERA: Named Entity Recognition for Arabic , 2009 .

[64]  Shih-Hung Wu,et al.  Mencius: A Chinese Named Entity Recognizer Using Hybrid Model , 2003, ROCLING.

[65]  Shih-Hung Wu,et al.  Mencius: A Chinese Named Entity Recognizer Using the Maximum Entropy-based Hybrid Model , 2004, Int. J. Comput. Linguistics Chin. Lang. Process..

[66]  Yassine Benajiba,et al.  ANERsys 2.0: Conquering the NER Task for the Arabic Language by Combining the Maximum Entropy with POS-tag Information , 2007, IICAI.

[67]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .