Text Classification of Kannada Webpages Using Various Pre-processing Agents

Text classification of Webpages has wide applications and many techniques have been employed to achieve the same. In this paper, an attempt is made to classify Kannada webpages into pre-determined 6 classes or categories. Kannada is a morphologically rich Indian Language. Kannada Webpages are subjected to different pre-processing steps and machine learning techniques like Naive Bayes and Maximum Entropy are applied to train models. All the pre-processing steps before classification are implemented as intelligent agents doing a particular task like Language Identification, Sentence Boundary detection and Term frequency calculation. It is observed that highest accuracy of 0.9 is achieved using both stemming and stopword removal.

[1]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[2]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[3]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[4]  Frans Coenen,et al.  Text classification using graph mining-based feature extraction , 2010 .

[5]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[6]  Vishal Gupta,et al.  Algorithm for Punjabi Text Classification , 2012 .

[7]  K. Raghuveer,et al.  Text Categorization in Indian Languages using Machine Learning Approaches , 2007, IICAI.

[8]  P RamakanthKumar.,et al.  Language Identification of Kannada Language using N- Gram , 2012 .

[9]  K. Srikanta Murthy,et al.  An analysis of sentence level text classification for the Kannada language , 2011, 2011 International Conference of Soft Computing and Pattern Recognition (SoCPaR).

[10]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[11]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[12]  Takashi Washio,et al.  Automatic Web-Page Classification by Using Machine Learning Methods , 2001, Web Intelligence.

[13]  Alaa M. El-Halees,et al.  Arabic Text Classification Using Maximum Entropy , 2015 .

[14]  Qiang Yang,et al.  Transferring Naive Bayes Classifiers for Text Classification , 2007, AAAI.

[15]  Timothy A. Gonsalves,et al.  Feature Selection for Text Classification Based on Gini Coefficient of Inequality , 2010, FSDM.

[16]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[17]  Tianshun Yao,et al.  An evaluation of statistical spam filtering techniques , 2004, TALIP.

[18]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[19]  Vishal Gupta,et al.  Domain Based Classification of Punjabi Text Documents using Ontology and Hybrid Based Approach , 2012, WSSANLP@COLING.

[20]  Huan-Chao Keh,et al.  Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values , 2010, Knowl. Based Syst..

[21]  Ee-Peng Lim,et al.  Web classification using support vector machine , 2002, WIDM '02.

[22]  P RamakanthKumar.,et al.  Sentence Boundary Detection in Kannada Language , 2012 .

[23]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[24]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[25]  Dell Zhang,et al.  Question classification using support vector machines , 2003, SIGIR.

[26]  Hakan Ferhatosmanoglu,et al.  Short text classification in twitter to improve information filtering , 2010, SIGIR.

[27]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[28]  Ning Zhong,et al.  Web Intelligence: Research and Development , 2001, Lecture Notes in Computer Science.

[29]  Felix Naumann,et al.  Data fusion , 2009, CSUR.