Supervised Urdu Word Segmentation Model Based on POS Information

Urdu is the national language of Pakistan, also the most widely spoken and understandable language of the globe. In order to accomplish successful Urdu NLP a robust and high-performance NLP tools and resources are utmost necessary. Word segmentation takes on an authoritative role for morphologically rich languages such as Urdu for diverse NLP domains such as named entity recognition, sentiment analysis, part of speech tagging, information retrieval etc. The morphological richness property of Urdu adds to the challenges of the word segmentation task, because a single word can be composed of null or a few prefixes, a stem and null or a few suffixes. In this paper we present supervised Urdu word segmentation scheme based on part of speech (POS) information of the corresponding words. For experiments conditional random fields (CRF) with contextual feature is used. The performance of the proposed system is evaluated on 300K words, results shows evidential improvements on baseline approach.

[1]  S.A. Khan,et al.  Urdu online handwriting recognition , 2005, Proceedings of the IEEE Symposium on Emerging Technologies, 2005..

[2]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[3]  Pushpak Bhattacharyya,et al.  A Hybrid Model for Urdu Hindi Transliteration , 2009, NEWS@IJCNLP.

[4]  Javed Ahmed Mahar,et al.  A MODEL FOR SINDHI TEXT SEGMENTATION INTO WORD TOKENS , 2012 .

[5]  Tehmina Amjad,et al.  A survey on the state-of-the-art machine learning models in the context of NLP , 2016 .

[6]  Rohini K. Srihari,et al.  An Information-Extraction System for Urdu---A Resource-Poor Language , 2010, TALIP.

[7]  Gurpreet Singh Lehal A Two Stage Word Segmentation System for Handling Space Insertion Problem in Urdu Script , 2009 .

[8]  Thesis TYPOLOGY OF WORD AND AUTOMATIC WORD SEGMENTATION IN URDU TEXT CORPUS , 2007 .

[9]  Sarmad Hussain,et al.  Assas-band, an Affix-Exception-List Based Urdu Stemmer , 2009, ALR7@IJCNLP.

[10]  Barbara J. Grosz,et al.  Natural-Language Processing , 1982, Artificial Intelligence.

[11]  Gurpreet Singh Lehal Ligature Segmentation for Urdu OCR , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[12]  Boonserm Kijsirikul,et al.  Feature-based Thai unknown word boundary identification using Winnow , 1998, IEEE. APCCAS 1998. 1998 IEEE Asia-Pacific Conference on Circuits and Systems. Microelectronics and Integrating Systems. Proceedings (Cat. No.98EX242).

[13]  Waqas Anwar,et al.  A hybrid approach for urdu sentence boundary disambiguation , 2012, Int. Arab J. Inf. Technol..

[14]  Harald Hammarström,et al.  Urdu Morphology, Orthography and Lexicon Extraction , 2007 .

[15]  Tafseer Ahmed,et al.  Hindi to Urdu Conversion: Beyond Simple Transliteration , 2009 .

[16]  Kashif Riaz,et al.  Rule-Based Named Entity Recognition in Urdu , 2010, NEWS@ACL.

[17]  Gurpreet Lehal A Word Segmentation System for Handling Space Omission Problem in Urdu Script , 2010 .

[18]  Y. O. M. E. Hadj,et al.  ARABIC PART-OF-SPEECH TAGGING USING THE SENTENCE STRUCTURE , 2022 .

[19]  Nadir Durrani,et al.  Urdu Word Segmentation , 2010, NAACL.

[20]  Kashif Riaz,et al.  A Study in Urdu Corpus Construction , 2002, ALR@COLING.

[21]  Sarmad Hussain,et al.  Resources for Urdu Language Processing , 2008, IJCNLP.

[22]  Ali Daud,et al.  Urdu language processing: a survey , 2017, Artificial Intelligence Review.

[23]  Sarmad Hussain,et al.  Study of Noun Phrase in Urdu , 2016 .

[24]  Tony McEnery,et al.  Corpus data for South Asian language processing. , 2003 .

[25]  Dekai Wu A Trainable Rule-based Algorithm for Word Segmentation , 2002 .