Second Generation AMIRA Tools for Arabic Processing : Fast and Robust Tokenization , POS tagging , and Base Phrase Chunking

In this paper, we address the problem of processing Modern St andard Arabic. We present the second generation of tools tha t process Arabic (AMIRA). AMIRA is a successor suite to the ASV MTools. The AMIRA toolkit includes a clitic tokenizer (TOK), part of speech tagger (POS) and base phrase chunker (B PC) shallow syntactic parser. The technology of AMIRA is based on supervised learning with no explicit dependence o xplicit modeling or knowledge of deep morphology. AMIRA is based on using a unified framework casting each of the component problems as a classification task. The underlying technology employs Support Vector Machines in a sequence modeling framework using the YAMCHA toolkit. The system is very fast and robust and allows for a number of va riable user settings depending on the disambiguation granularity. The AMIRA toolkit has been widely used for diff erent NLP (MT, IE, IR, NER, etc.) applications due to its speed and high performance.