Arabic named entity operational recognition system

Extracting named entities is an important step for information extraction from a text, based on a given ontology. Dealing with Arabic language invokes an additional number of challenges compared to English, French and other languages within similar families. The major difficulties involve complex morphological systems, no capitalization, and no standardization of Arabic writing. The Arabic language has a rich and complex morphological landscape due to its highly inflected nature. Usually, any Arabic lemma word can be constructed using different internal structure, prefixes and suffixes. Furthermore, there is no standardization of Arabic writing because of the spelling inconsistency of Arabic words. In this work, we propose an operational hybrid approach combining dictionary-based and rule-based detection for extracting seven categories of named entities which are organization by name, date, interval, price/value, percentage, currency and unit. The dictionary-based approach performs exact or approximate matching of the words with prepared Arabic organization names. In case of non-exact matching with the dictionary words, the approximate matching is an efficient solution for morphological difficulties. Specificities of Arabic language are also processed by rule-based detection, which is based on capturing the entities patterns in terms of regular expressions or patterns provided by experts. We evaluated our Arabic name entity recognition system using financial news articles and we obtained around an 80% of recognition rate.