A semi-supervised associative classification method for POS tagging

We present here a data mining approach for part-of-speech (POS) tagging, an important Natural language processing (NLP) classification task. We propose a semi-supervised associative classification method for POS tagging. Existing methods for building POS taggers require extensive domain and linguistic knowledge and resources. Our method uses a combination of a small POS tagged corpus and untagged text data as training data to build the classifier model using association rules. Our tagger works well with very little training data also. The use of semi-supervised learning provides the advantage of not requiring a large high quality tagged corpus. These properties make it especially suitable for resource poor languages. Our experiments on various resource-rich, resource-moderate and resource-poor languages show good performance without using any language specific linguistic information. We note that inclusion of such features in our method may further improve the performance. Results also show that for smaller training data sizes our tagger performs better than state-of-the-art CRF tagger using same features as our tagger.

[1]  Dae-Won Kim,et al.  Classification Based on Predictive Association Rules of Incomplete Data , 2012, IEICE Trans. Inf. Syst..

[2]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[3]  K. P. Soman,et al.  Tamil POS Tagging using Linear Programming , 2009 .

[4]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[5]  Avinesh Pvs,et al.  Part-Of-Speech Tagging and Chunking using Conditional Random Fields and Transformation Based Learning , 2006 .

[6]  Shao Hong Yin,et al.  Research of POS Tagging Rules Mining Algorithm , 2013 .

[7]  Meher Vijay Yeleti Improving statistical POS tagging using Linguistic feature for Hindi and Telugu , 2022 .

[8]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[9]  Slav Petrov,et al.  Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models , 2010, EMNLP.

[10]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[11]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[12]  Sudeshna Sarkar,et al.  Automatic Part-of-Speech Tagging for Bengali: An Approach for Morphologically Rich Languages in a Poor Resource Scenario , 2007, ACL.

[13]  Michele Banko,et al.  Part-of-Speech Tagging in Context , 2004, COLING.

[14]  S. M. Kamruzzaman,et al.  Text Classification using Association Rule with a Hybrid Concept of Naive Bayes Classifier and Genetic Algorithm , 2010, ArXiv.

[15]  Om Prakash Vyas,et al.  Using Associative Classifiers for Predictive Analysis in Health Care Data Mining , 2010 .

[16]  David Yarowsky,et al.  One Sense per Collocation , 1993, HLT.

[17]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[18]  Fei Xia,et al.  A Multi-Representational and Multi-Layered Treebank for Hindi/Urdu , 2009, Linguistic Annotation Workshop.

[19]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[20]  Vikram Pudi,et al.  Class Based Weighted K-Nearest Neighbor over Imbalance Dataset , 2013, PAKDD.

[21]  Nancy Ide,et al.  The American National Corpus First Release , 2004, LREC.

[22]  Hwee Tou Ng,et al.  Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages , 2009, EMNLP.

[23]  Sivaji Bandyopadhyay,et al.  Voted Approach for Part of Speech Tagging in Bengali , 2009, PACLIC.

[24]  Pushpak Bhattacharyya,et al.  Hindi POS Tagger Using Naive Stemming : Harnessing Morphological Information Without Extensive Linguistic Knowledge , 2008 .

[25]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[26]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[27]  Anders Søgaard,et al.  Semi-supervised condensed nearest neighbor for part-of-speech tagging , 2011, ACL.

[28]  Thomas L. Griffiths,et al.  A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.

[29]  Jiawei Han,et al.  CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[30]  Fadi A. Thabtah,et al.  A review of associative classification mining , 2007, The Knowledge Engineering Review.

[31]  Osmar R. Zaïane,et al.  Mammography Classification By an Association Rule-based Classifier , 2002, MDM/KDD.

[32]  Vikram Pudi,et al.  ACME: An Associative Classifier Based on Maximum Entropy Principle , 2005, ALT.

[33]  Dipti Misra Sharma,et al.  AnnCorra : Annotating Corpora Guidelines For POS And Chunk Annotation For Indian Languages , 2008 .