Bangla news classification using naive Bayes classifier

Web is gigantic and being constantly update. Bangla news in web are rapidly grown in the era of information age where each news site has its own different layout and categorization for grouping news. These heterogeneity of layout and categorization can not always satisfy individual user's need. Removing these heterogeneity and classifying the news articles according to user preference is a formidable task. In this paper, we propose an approach that provides a user to find out news articles which are related to a specific classification. We use our own developed web crawler to extract useful text from HTML pages of news article contents to construct a Full-Text-RSS. Each news article contents is tokenized with a modified light-weight Bangla Stemmer. In order to achieve better classification result, we remove the less significant words i.e. stop - word from the document. We apply the naive Bayes classifier for classification of Bangla news article contents based on news code of IPTC. Our experimental result shows the effectiveness of our classification system.

[1]  Karl-Michael Schneider,et al.  Techniques for Improving the Performance of Naive Bayes for Text Classification , 2005, CICLing.

[2]  Song Han,et al.  Automatic Identification of Chinese Stop Words , 2006 .

[3]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[4]  Naushad UzZaman,et al.  Analysis of N-Gram based text categorization for Bangla in a newspaper , 2006 .

[5]  Dik Lun Lee,et al.  Feature reduction for neural network based text categorization , 1999, Proceedings. 6th International Conference on Advanced Systems for Advanced Applications.

[6]  Sung-Bae Cho,et al.  Learning Neural Network Ensemble for Practical Text Classification , 2003, IDEAL.

[7]  Fadi Thabtah,et al.  Naïve Bayesian Based on Chi Square to Categorize Arabic Data , 2009 .

[8]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[9]  Tong Zhang,et al.  A decision-tree-based symbolic rule induction system for text categorization , 2002, IBM Syst. J..

[10]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[11]  Naohiro Ishii,et al.  Combining Multiple K-Nearest Neighbor Classifiers for Text Classification by Reducts , 2002, Discovery Science.

[12]  A. Suresh Babu Comparing Neural Network Approach With N- Gram Approach For Text Categorization , 2010 .

[13]  Ee-Peng Lim,et al.  Automated online news classification with personalization , 2001 .

[14]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[15]  Peretz Shoval,et al.  ONTOLOGY-BASED CLASSIFICATION OF NEWS IN AN ELECTRONIC NEWSPAPER , 2008 .

[16]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[17]  Vipin Kumar,et al.  Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification , 2001, PAKDD.

[18]  Shixiong Xia,et al.  An Improved KNN Text Classification Algorithm Based on Clustering , 2009, J. Comput..

[19]  Manuel de Buenaga Rodríguez,et al.  Using WordNet to Complement Training Information in Text Categorization , 1997, ArXiv.

[20]  Eibe Frank,et al.  Naive Bayes for Text Classification with Unbalanced Classes , 2006, PKDD.

[21]  Enhong Chen,et al.  TextCC: New Feed Forward Neural Network for Classifying Documents Instantly , 2005, ISNN.

[22]  James G. Shanahan,et al.  Improving SVM Text Classification Performance through Threshold Adjustment , 2003, ECML.

[23]  Johannes Fürnkranz,et al.  A Study Using $n$-gram Features for Text Categorization , 1998 .

[24]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[25]  Md. Zahurul Islam,et al.  A light weight stemmer for Bengali and its use in spelling checker , 2007 .

[26]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.