NE Tagging for Urdu based on Bootstrap POS Learning

Part of Speech (POS) tagging and Named Entity (NE) tagging have become important components of effective text analysis. In this paper, we propose a bootstrapped model that involves four levels of text processing for Urdu. We show that increasing the training data for POS learning by applying bootstrapping techniques improves NE tagging results. Our model overcomes the limitation imposed by the availability of limited ground truth data required for training a learning model. Both our POS tagging and NE tagging models are based on the Conditional Random Field (CRF) learning approach. To further enhance the performance, grammar rules and lexicon lookups are applied on the final output to correct any spurious tag assignments. We also propose a model for word boundary segmentation where a bigram HMM model is trained for character transitions among all positions in each word. The generated words are further processed using a probabilistic language model. All models use a hybrid approach that combines statistical models with hand crafted grammar rules.

[1]  Sivaji Bandyopadhyay,et al.  A Hidden Markov Model Based Named Entity Recognition System: Bengali and Hindi as Case Studies , 2007, PReMI.

[2]  Sarmad Hussain,et al.  Resources for Urdu Language Processing , 2008, IJCNLP.

[3]  Thanaruk Theeramunkong,et al.  Non-Dictionary-Based Thai Word Segmentation Using Decision Trees , 2001, HLT.

[4]  D. S. Kushwaha,et al.  A Comparative Study of Named Entity Recognition for Hindi Using Sequential Learning Algorithms , 2009, 2009 IEEE International Advance Computing Conference.

[5]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[6]  Tony McEnery,et al.  Corpus data for South Asian language processing. , 2003 .

[7]  Sarmad Hussain,et al.  STATISTICAL PART OF SPEECH TAGGER FOR URDU , 2007 .

[8]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[9]  Thesis TYPOLOGY OF WORD AND AUTOMATIC WORD SEGMENTATION IN URDU TEXT CORPUS , 2007 .

[10]  John T. Platts,et al.  A Grammar of the Hindustani or Urdu Language , 1874 .

[11]  Geoffrey Leech,et al.  Standards for Tagsets. , 1999 .

[12]  Harald Hammarström,et al.  Urdu Morphology, Orthography and Lexicon Extraction , 2007 .

[13]  R O H I N,et al.  InfoXtract : A customizable intermediate level information extraction engine , 2022 .

[14]  Geoffrey Leech,et al.  EAGLES recommendations for the morphosyntactic annotation of corpora , 1996 .

[15]  Thamar Solorio,et al.  Improvement of Named Entity Tagging by Machine Learning , 2004 .

[16]  Andrew Hardie,et al.  Developing a tagset for automated part-of-speech tagging in Urdu. , 2003 .

[17]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .

[18]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[19]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.