Self-Switching Classification Framework for Titled Documents

Ambiguous words refer to words that have different meanings such as apple, window, etc. In text classification they are usually removed by feature reduction methods like information gain. Sometimes there are too many ambiguous words in the corpus that we cannot simply throw them away, especially when classifying documents from the Web. In this paper we look for a method to classify titled documents with the help of ambiguous words. Titled documents are a kind of documents that have a simple structure containing a title and an excerpt. News, messages, and paper abstracts with titles are such examples. Instead of introducing another feature reduction method, we describe a framework to make the best of ambiguous words in the titled documents. The framework improves the performance of traditional bag-of-words classifier with the help of a bag-of-word-pairs classifier. We implement the framework using one of the most popular classifiers, multinomial naive Bayes (MNB), as a case in point. The experiments with three real life datasets show that in our framework the MNB model performs much better than traditional MNB classifier and the naive weighting algorithm, which simply puts more weight on the title words.

[1]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[2]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[3]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[4]  Kostas Tzeras,et al.  Automatic indexing based on Bayesian inference networks , 1993, SIGIR.

[5]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[6]  Leah S. Larkey,et al.  Automatic essay grading using text categorization techniques , 1998, SIGIR '98.

[7]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[8]  Soon Myoung Chung,et al.  Text document clustering based on frequent word meaning sequences , 2008, Data Knowl. Eng..

[9]  Stan Matwin,et al.  A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization , 2001 .

[10]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[11]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[12]  Rong Jin,et al.  Title language model for information retrieval , 2002, SIGIR '02.

[13]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[14]  Naftali Tishby,et al.  The Power of Word Clusters for Text Classification , 2006 .

[15]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[16]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[17]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[18]  Irena Koprinska,et al.  A neural network based approach to automated e-mail classification , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).