An Optimal Weighting Method in Supervised Learning of Linguistic Model for Text Classification

This paper discusses a new weighting method for text analyzing from the view point of supervised learning. The term frequency and inverse term frequency measure (tf-idf measure) is famous weighting method for information retrieval, and this method can be used for text analyzing either. However, it is an experimental weighting method for information retrieval whose effectiveness is not clarified from the theoretical viewpoints. Therefore, other effective weighting measure may be obtained for document classification problems. In this study, we propose the optimal weighting method for document classification problems from the view point of supervised learning. The proposed measure is more suitable for the text classification problem as used training data than the tf-idf measure. The effectiveness of our proposal is clarified by simulation experiments for the text classification problems of newspaper article and the customer review which is posted on the web site.

[1]  Marti A. Hearst Untangling Text Data Mining , 1999, ACL.

[2]  Kenta Mikawa,et al.  A proposal of extended cosine measure for distance metric learning in text classification , 2011, 2011 IEEE International Conference on Systems, Man, and Cybernetics.

[3]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[4]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[5]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[6]  Shigeichi Hirasawa,et al.  Statistical Evaluation of Measure and Distance on Document Classification Problems in Text Mining , 2007, 7th IEEE International Conference on Computer and Information Technology (CIT 2007).

[7]  S. Hirasawa,et al.  Asymptotic evaluation of distance measure on high dimensional vector spaces in text mining , 2008, 2008 International Symposium on Information Theory and Its Applications.

[8]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[9]  Akiko Aizawa The feature quantity: an information theoretic perspective of Tfidf-like measures , 2000, SIGIR '00.

[10]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[11]  Byung Soo Kim,et al.  Non-Identical Parallel Machine Scheduling with Sequence and Machine Dependent Setup Times Using Meta-Heuristic Algorithms , 2012 .

[12]  Masaaki Nagata,et al.  A Stochastic Japanese Morphological Analyzer Using a Forward-DP Backward-A* N-Best Search Algorithm , 1994, COLING.

[13]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..