论文信息 - Towards an Optimal Solution to Lemmatization in Arabic

Towards an Optimal Solution to Lemmatization in Arabic

Abstract Lemmatization—computing the canonical forms of words in running text—is an important component in any NLP system and a key preprocessing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to the absence of short vowels in writing. In this paper, we introduce a new lemmatizer tool that combines a machine-learning-based approach with a lemmatization dictionary, the latter providing increased accuracy, robustness, and flexibility to the former. Our evaluations yield a performance of over 98% for the entire lemmatization pipeline. The lemmatizer tools are freely downloadable for private and research purposes.

[1] Khaled Shaalan,et al. A Survey of Arabic Named Entity Recognition and Classification , 2014, CL.

[2] Torsten Zesch,et al. A Survey and Comparative Study of Arabic Diacritization Tools , 2017, J. Lang. Technol. Comput. Linguistics.

[3] Roberto Navigli,et al. Word sense disambiguation: A survey , 2009, CSUR.

[4] Nizar Habash,et al. MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[5] Fausto Giunchiglia,et al. Concept Search , 2009, ESWC.

[6] Abdelhak Lakhouaja,et al. Towards a standard Part of Speech tagset for the Arabic language , 2017, J. King Saud Univ. Comput. Inf. Sci..