Arabic Morphological Analyzer with Agglutinative Affix Morphemes and Fusional Concatenation Rules

Current concatenative morphological analyzers consider prefix, suffix and stem morphemes based on lexicons of morphemes, and morpheme concatenation rules that determine whether prefix-stem, stem-suffix, and prefix-suffix concatenations are allowed. Existing affix lexicons contain extensive redundancy, suffer from inconsistencies, and require significant manual work to augment with clitics and partial affixes if needed. Unlike traditional work, our method considers Arabic affixes as fusional and agglutinative, i.e. composed of one or more morphemes, introduces new compatibility rules for affix-affix concatenations, and refines the lexicons of the SAMA and BAMA analyzers to be smaller, less redundant, and more consistent. It also automatically and perfectly solves the correspondence problem between the segments of a word and the corresponding tags, e.g. part of speech and gloss tags. Title and Abstract in another language, L2 (optional, and on same page) BAMA SAMA

[1]  Ibrahim A. Al-Kharashi,et al.  Arabic morphological analysis techniques: A comprehensive survey , 2004, J. Assoc. Inf. Sci. Technol..

[2]  Mohammed A. Attia An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic Modeling Finite State Networks , 2006, BCS.

[3]  Seth Kulick,et al.  Consistent and Flexible Integration of Morphological Annotation in the Arabic Treebank , 2010, LREC.

[4]  Nasredine Semmar,et al.  Modifying a Natural Language Processing System for European Languages to Treat Arabic in Information Processing and Information Retrieval Applications , 2005, SEMITIC@ACL.

[5]  José Luis Martínez-Fernández,et al.  A real time Named Entity Recognition system for Arabic text mining , 2011, Language Resources and Evaluation.

[6]  Ossama Emam,et al.  Language Model Based Arabic Word Segmentation , 2003, ACL.

[7]  Jun-Ichi Aoe An Efficient Digital Search Algorithm by Using a Double-Array Structure , 1989, IEEE Trans. Software Eng..

[8]  Otakar Smrz,et al.  ElixirFM – Implementation of Functional Arabic Morphology , 2007, SEMITIC@ACL.

[9]  Yassine Benajiba,et al.  ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy , 2009, CICLing.

[10]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[11]  Seth Kulick,et al.  Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation , 2008, LREC.

[12]  Khaled Shaalan,et al.  Morphological Analysis of Ill-Formed Arabic Verbs in Intelligent Language Tutoring Framework , 2010, FLAIRS Conference.

[13]  Kenneth R. Beesley,et al.  Finite-State Morphological Analysis and Generation of Arabic at Xerox Research: Status and Plans in 2001 , 2001 .

[14]  Nizar Habash,et al.  Morphological Analysis and Generation for Arabic Dialects , 2005, SEMITIC@ACL.

[15]  Ann Bies,et al.  Developing an Arabic Treebank: Methods, Guidelines, Procedures, and Tools , 2004 .

[16]  Kareem Darwish,et al.  Building a Shallow Arabic Morphological Analyser in One Day , 2002, SEMITIC@ACL.

[17]  Nizar Habash,et al.  Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowledged. Arabic Preprocessing Schemes for Statistical Machine Translation , 2006 .

[18]  Regina Barzilay,et al.  Modeling Syntactic Context Improves Morphological Segmentation , 2011, CoNLL.

[19]  Jan Hajic,et al.  Prague Arabic Dependency Treebank: Development in Data and Tools , 2004 .

[20]  Diane C. Lillo-Martin,et al.  Blackwell Textbooks in Linguistics , 2005 .