Morphology-based Entity and Relational Entity Extraction Framework for Arabic

Rule-based techniques and tools to extract entities and relational entities from documents allow users to specify desired entities using natural language questions, finite state automata, regular expressions, structured query language statements, or proprietary scripts. These techniques and tools require expertise in linguistics and programming and lack support of Arabic morphological analysis which is key to process Arabic text. In this work, we present MERF; a morphology-based entity and relational entity extraction framework for Arabic text. MERF provides a user-friendly interface where the user, with basic knowledge of linguistic features and regular expressions, defines tag types and interactively associates them with regular expressions defined over Boolean formulae. Boolean formulae range over matches of Arabic morphological features, and synonymity features. Users define user defined relations with tuples of subexpression matches and can associate code actions with subexpressions. MERF computes feature matches, regular expression matches, and constructs entities and relational entities from user defined relations. We evaluated our work with several case studies and compared with existing application-specific techniques. The results show that MERF requires shorter development time and effort compared to existing techniques and produces reasonably accurate results within a reasonable overhead in run time.

[1]  Hayssam N. Traboulsi,et al.  Arabic named entity extraction: A local grammar-based approach , 2009, IMCSIT.

[2]  C. Papadimitriou,et al.  Introduction to the Theory of Computation , 2018 .

[3]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[4]  Sivaji Bandyopadhyay,et al.  Named Entity Recognition using Support Vector Machine: A Language Independent Approach , 2010 .

[5]  Marwa Magdy,et al.  Integrated Machine Learning Techniques for Arabic Named Entity Recognition , 2010 .

[6]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[7]  Christoph Müller,et al.  Multi-level annotation of linguistic data with MMAX 2 , 2006 .

[8]  Petr Pajas,et al.  MorphoTrees of Arabic and Their Annotation in the TrEd Environment , .

[9]  Seth Kulick,et al.  Consistent and Flexible Integration of Morphological Annotation in the Arabic Treebank , 2010, LREC.

[10]  K. E I T,et al.  Interlingual Annotation of Parallel Text Corpora: A New Framework for Annotation and Evaluation , 2004 .

[11]  Günter Neumann,et al.  Arabic Computational Morphology: Knowledge-based and Empirical Methods , 2007 .

[12]  Bruno Pouliquen,et al.  Adapting a resource-light highly multilingual Named Entity Recognition system to Arabic , 2010, LREC.

[13]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[14]  Ibrahim A. Al-Kharashi,et al.  Arabic morphological analysis techniques: A comprehensive survey , 2004, J. Assoc. Inf. Sci. Technol..

[15]  F. Zaraket,et al.  Arabic Temporal Entity Extraction using Morphological Analysis , 2012 .

[16]  Nizar Habash,et al.  CamelParser: A system for Arabic Syntactic Analysis and Morphological Disambiguation , 2016, COLING.

[17]  Jay Urbain User-driven relational models for entity-relation search and extraction , 2012, JIWES '12.

[18]  Frank Puppe,et al.  Rule-Based Information Extraction for Structured Data Acquisition using TextMarker , 2008, LWA.

[19]  Hisham A. Kholidy,et al.  Towards developing an Arabic word alignment annotation tool with some Arabic alignment guidelines , 2010, 2010 10th International Conference on Intelligent Systems Design and Applications.

[20]  Duncan Temple Lang,et al.  JavaScript Object Notation , 2014 .

[21]  Mohsen Rashwan,et al.  Fassieh¯, a Semi-Automatic Visual Interactive Tool for Morphological, PoS-Tags, Phonetic, and Semantic Annotation of Arabic Text Corpora , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Thomas S. Morton,et al.  WordFreak: An Open Tool for Linguistic Annotation , 2003, HLT-NAACL.

[23]  Douglas E. Appelt,et al.  The Common Pattern Specification Language , 1998, TIPSTER.

[24]  Burr Settles,et al.  Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances , 2011, EMNLP.

[25]  Philip V. Ogren,et al.  Knowtator: A Protégé plug-in for annotated corpus construction , 2006, NAACL.

[26]  Eitan M. Gurari,et al.  Introduction to the theory of computation , 1989 .

[27]  Simon Munzert XML and Web Technologies for Data Sciences with R , 2014 .

[28]  Fadi A. Zaraket,et al.  Arabic Cross-Document NLP for the Hadith and Biography Literature , 2012, FLAIRS Conference.

[29]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[30]  Fadi A. Zaraket,et al.  Arabic Morphological Analyzer with Agglutinative Affix Morphemes and Fusional Concatenation Rules , 2012, COLING.

[31]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[32]  Henrik Eriksson,et al.  The evolution of Protégé: an environment for knowledge-based systems development , 2003, Int. J. Hum. Comput. Stud..

[33]  Fadi A. Zaraket,et al.  MATAr: Morphology-based Tagger for Arabic , 2013, 2013 ACS International Conference on Computer Systems and Applications (AICCSA).

[34]  Yassine Benajiba,et al.  Arabic Named Entity Recognition using Optimized Feature Sets , 2008, EMNLP.

[35]  Frederick Reiss,et al.  SystemT: An Algebraic Approach to Declarative Information Extraction , 2010, ACL.

[36]  Nizar Habash,et al.  Interlingual annotation of parallel text corpora: a new framework for annotation and evaluation , 2010, Natural Language Engineering.

[37]  Hend Suliman Al-Khalifa,et al.  AraTation: an Arabic semantic annotation tool , 2009, iiWAS.

[38]  Jean-Pierre Desclés,et al.  Semantic Annotation of Reported Information in Arabic , 2006, FLAIRS Conference.

[39]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[40]  M. A. R T A P A L,et al.  The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.

[41]  Nizar Habash,et al.  Supervised collaboration for syntactic annotation of Quranic Arabic , 2011, Language Resources and Evaluation.

[42]  Nizar Habash,et al.  Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowledged. Arabic Preprocessing Schemes for Statistical Machine Translation , 2006 .

[43]  Kate Cummings,et al.  Introduction to the Theory , 2015 .

[44]  Yassine Benajiba,et al.  ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy , 2009, CICLing.

[45]  Fadi A. Zaraket,et al.  Arabic Entity Graph Extraction Using Morphology, Finite State Machines, and Graph Transformations , 2012, CICLing.

[46]  Barbara J. Grosz,et al.  Natural-Language Processing , 1982, Artificial Intelligence.