A Two Stage Word Segmentation System for Handling Space Insertion Problem in Urdu Script

Hindi and Urdu are variants of the same language, but while Hindi is written in the Devanagari script from left to right, Urdu is written in a script derived from a Persian modification of Arabic script written from right to left. To break the script barrier an Urdu-Devnagri transliteration system has been developed. The transliteration system faced many problems related to word segmentation of Urdu script, as in many cases space is not properly put between Urdu words. Sometimes it is deleted resulting in many Urdu words being jumbled together and many other times extra space is put in word resulting in over segmentation of that word. In this paper, a two-stage system for handling the extra space insertion problem in Urdu has been presented. In the first stage, Urdu grammar rules have been applied, while a statistical based approach has been employed in the second stage. For statistical analysis, lexical resources from both Urdu and Hindi languages, including Urdu and Hindi unigram and bigram probabilities have been used. In addition the Urdu-Devnagri transliteration module is also executed in parallel to help in decision making. The system was tested on 1.84 million word Urdu corpus and the success rate was 98.57%. This is the first time such a system has been developed for Urdu script.