论文信息 - Text Classification of Kannada Webpages Using Various Pre-processing Agents

Text Classification of Kannada Webpages Using Various Pre-processing Agents

Text classification of Webpages has wide applications and many techniques have been employed to achieve the same. In this paper, an attempt is made to classify Kannada webpages into pre-determined 6 classes or categories. Kannada is a morphologically rich Indian Language. Kannada Webpages are subjected to different pre-processing steps and machine learning techniques like Naive Bayes and Maximum Entropy are applied to train models. All the pre-processing steps before classification are implemented as intelligent agents doing a particular task like Language Identification, Sentence Boundary detection and Term frequency calculation. It is observed that highest accuracy of 0.9 is achieved using both stemming and stopword removal.

P. Ramakanth Kumar | N. Deepamala

[1] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[2] Brian D. Davison,et al. Web page classification: Features and algorithms , 2009, CSUR.

[3] Daphne Koller,et al. Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[4] Frans Coenen,et al. Text classification using graph mining-based feature extraction , 2010 .

[5] Nello Cristianini,et al. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[6] Vishal Gupta,et al. Algorithm for Punjabi Text Classification , 2012 .

[7] K. Raghuveer,et al. Text Categorization in Indian Languages using Machine Learning Approaches , 2007, IICAI.

[8] P RamakanthKumar.,et al. Language Identification of Kannada Language using N- Gram , 2012 .

[9] K. Srikanta Murthy,et al. An analysis of sentence level text classification for the Kannada language , 2011, 2011 International Conference of Soft Computing and Pattern Recognition (SoCPaR).

[10] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[11] Yiming Yang,et al. A re-examination of text categorization methods , 1999, SIGIR '99.