Romanized Arabic Transliteration

In the early 1990’s, online communication was restricted to ASCII (English) only environments. A convention evolved for typing Arabic in Roman characters, this scripting took various names including: Franco Arabic, Romanized Arabic, Arabizi, Arabish, etc… The convention was widely adopted and today, romanized Arabic (RAr) is everywhere: In instant messaging, forums, blog postings, product and movie ads, on mobile phones and on TV! The problem is that the majority of Arab users are more used to the English keyboard layout, and while romanized Arabic is easier to type, Arabic is significantly easier to read, the obvious solution was automatic conversion of romanized Arabic to Arabic script, which would also lead to increasing the amount and quality of authored Arabic online content. The main challenges are that no standard convention of Romanized Arabic (many  1 mappings) is available and there are no parallel data available. We present here a hybrid approach that we devised and implemented to build a romanized Arabic transliteration engine that was later on scaled to cover other scripts. Our approach leverages the work done by Sherif and Kondrak’s (2007b) and Cherry and Suzuki (2009), and is heavily inspired by the basic phrase-based statistical machine translation approach devised by (Och, 2003).