Hadoop MapReduce implementation of a novel scheme for term weighting in text categorization

Text Categorization is problem assigning text documents into fixed number of pre-defined categories. Feature selection and Term weighting are two important steps that decide the result of any Text Categorization problem. In this paper we focus on two things first is to develop effective term weighting by proposing new term weighting scheme and second is to utilize the parallel and distributed processing capability of Hadoop MapReduce for training and testing of dataset. These two things leads to great performance improvement of text categorization by remarkable improvement in accuracy with a significant reduction of computational cost. Also because of the use of Hadoop MapReduce it reduces the training and testing time significantly.

[1]  Fagbola Temitayo,et al.  Hybrid GA-SVM for Efficient Feature Selection in E-mail Classification , 2012 .

[2]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Han Tong Loh,et al.  Imbalanced text classification: A term weighting approach , 2009, Expert Syst. Appl..

[4]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[5]  Christopher Meek,et al.  Challenges of the Email Domain for Text Classification , 2000, ICML.

[6]  Cory J. Butz,et al.  Interval set representations of 1-v-r support vector machine multi-classifiers , 2005, 2005 IEEE International Conference on Granular Computing.

[7]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8]  S. Sathya,et al.  Application of Hadoop MapReduce technique to Virtual Database system design , 2011, 2011 International Conference on Emerging Trends in Electrical and Computer Technology.

[9]  Bin Wu,et al.  Design and implementation of parallel statiatical algorithm based on Hadoop's MapReduce model , 2011, 2011 IEEE International Conference on Cloud Computing and Intelligence Systems.

[10]  Fuji Ren,et al.  Class-indexing-based term weighting for automatic text classification , 2013, Inf. Sci..

[11]  Saurabh Khatri Review on Classification Algorithms in Email Domain , 2013 .

[12]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[13]  Jinwoo Park,et al.  Improving text categorization using the importance of sentences , 2004, Inf. Process. Manag..

[14]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[15]  Makoto Suzuki,et al.  On a new model for automatic text categorization based on Vector Space Model , 2010, 2010 IEEE International Conference on Systems, Man and Cybernetics.

[16]  Wenyin Liu,et al.  Term Weighting Schemes for Question Categorization , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Chung Keung Poon,et al.  Using phrases as features in email classification , 2009, J. Syst. Softw..