A comparative study on feature reduction approaches in Hindi and Bengali named entity recognition

Features used for named entity recognition (NER) are often high dimensional in nature. These cause overfitting when training data is not sufficient. Dimensionality reduction leads to performance enhancement in such situations. There are a number of approaches for dimensionality reduction based on feature selection and feature extraction. In this paper we perform a comprehensive and comparative study on different dimensionality reduction approaches applied to the NER task. To compare the performance of the various approaches we consider two Indian languages namely Hindi and Bengali. NER accuracies achieved in these languages are comparatively poor as yet, primarily due to scarcity of annotated corpus. For both the languages dimensionality reduction is found to improve performance of the classifiers. A Comparative study of the effectiveness of several dimensionality reduction techniques is presented in detail in this paper.

[1]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[2]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[3]  Thorsten Brants,et al.  Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation , 2008, ACL.

[4]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[5]  R. Bekkerman Distributional Word Clusters vs , 2006 .

[6]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[7]  Scott Miller,et al.  Name Tagging with Word Clusters and Discriminative Training , 2004, NAACL.

[8]  Hermann Ney,et al.  Maximum Entropy Models for Named Entity Recognition , 2003, CoNLL.

[9]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[10]  J. Jenkins,et al.  Word association norms , 1964 .

[11]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[12]  Wei Li,et al.  Rapid development of Hindi named entity recognition using conditional random fields and feature induction , 2003, TALIP.

[13]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[14]  R. Dhanapal An intelligent information retrieval agent , 2008, Knowl. Based Syst..

[15]  Muhammad Sher,et al.  HMM and fuzzy logic: A hybrid approach for online Urdu script-based languages' character recognition , 2010, Knowl. Based Syst..

[16]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[17]  José Ranilla,et al.  Introducing a family of linear measures for feature selection in text categorization , 2005, IEEE Transactions on Knowledge and Data Engineering.

[18]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[19]  Anil Kumar Singh,et al.  Named Entity Recognition for South and South East Asian Languages: Taking Stock , 2008, IJCNLP.

[20]  James Theiler,et al.  Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space , 2003, J. Mach. Learn. Res..

[21]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[22]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[23]  David G. Stork,et al.  Pattern Classification , 1973 .

[24]  Harun Uguz,et al.  A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm , 2011, Knowl. Based Syst..

[25]  Frans Coenen,et al.  Corpus callosum MR image classification , 2010, Knowl. Based Syst..

[26]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[27]  Akira Ushioda,et al.  Hierarchical Clustering of Words , 1996, COLING.

[28]  Minqiang Li,et al.  Multinomial mixture model with feature selection for text clustering , 2008, Knowl. Based Syst..

[29]  Ran El-Yaniv,et al.  Distributional Word Clusters vs. Words for Text Categorization , 2003, J. Mach. Learn. Res..

[30]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[31]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[32]  Qinglin Guo,et al.  Multi-documents Automatic Abstracting based on text clustering and semantic analysis , 2009, Knowl. Based Syst..

[33]  Pabitra Mitra,et al.  Word Clustering and Word Selection Based Feature Reduction for MaxEnt Based Hindi NER , 2008, ACL.

[34]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[35]  Asif Ekbal,et al.  Multiobjective optimization for classifier ensemble and feature selection: an application to named entity recognition , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[36]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[37]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[38]  Wang-Shu Lu,et al.  Approximate Bayesian shrinkage estimation , 1994, Annals of the Institute of Statistical Mathematics.

[39]  Christian Biemann,et al.  Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems , 2006 .

[40]  Mitsuru Ishizuka,et al.  Graph-based Word Clustering using a Web Search Engine , 2006, EMNLP.

[41]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.