Urdu Word Segmentation

Word Segmentation is the foremost obligatory task in almost all the NLP applications where the initial phase requires tokenization of input into words. Urdu is amongst the Asian languages that face word segmentation challenge. However, unlike other Asian languages, word segmentation in Urdu not only has space omission errors but also space insertion errors. This paper discusses how orthographic and linguistic features in Urdu trigger these two problems. It also discusses the work that has been done to tokenize input text. We employ a hybrid solution that performs an n-gram ranking on top of rule based maximum matching heuristic. Our best technique gives an error detection of 85.8% and overall accuracy of 95.8%. Further issues and possible future directions are also discussed.

[1]  Avrim Blum,et al.  Empirical Support for Winnow and Weighted-Majority Algorithms: Results on a Calendar Scheduling Domain , 2004, Machine Learning.

[2]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[3]  Wirote Aroonmanakun,et al.  Collocation and Thai Word Segmentation , 2002 .

[4]  Boonserm Kijsirikul,et al.  Feature-based Thai unknown word boundary identification using Winnow , 1998, IEEE. APCCAS 1998. 1998 IEEE Asia-Pacific Conference on Circuits and Systems. Microelectronics and Integrating Systems. Proceedings (Cat. No.98EX242).

[5]  S. Hussain,et al.  Spelling Error Trends in Urdu , 2007 .

[6]  Surapant Meknavin,et al.  Feature-based Thai Word Segmentation , 1997 .

[7]  Chorkin Chan,et al.  Chinese Word Segmentation based on Maximum Matching and Word Binding Force , 1996, COLING.

[8]  Sarmad Hussain,et al.  Letter-to-Sound Conversion for Urdu Text-to-Speech System , 2004, COLING 2004.