Improved Nearest Neighbor Methods For Text Classification With Language Modeling and Harmonic Functions

We present new nearest neighbor methods for text classification and an evaluation of these methods against the existing nearest neighbor methods as well as other well-known text classification algorithms. Inspired by the language modeling approach to information retrieval, we show improvements in k-nearest neighbor (kNN) classification by replacing the classical cosine similarity with a KL divergence based similarity measure. We also present an extension of kNN to the semi-supervised case which turns out to be a formulation that is equivalent to semi-supervised learning with harmonic functions. In both supervised and semi-supervised experiments, our algorithms surpass the state-of-the-art methods such as Support Vector Machines (SVM) and transductive SVM on the Reuters Corpus Volume I (RCV1) and the 20 Newsgroups dataset, and produce competitive results on the Reuters-21578 dataset. To our knowledge, this paper presents the most comprehensive evaluation of different machine learning algorithms on the entire RCV1 dataset.

[1]  Andrew McCallum,et al.  Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora , 2005 .

[2]  Kenrick J. Mock An experimental framework for email categorization and management , 2001, SIGIR '01.

[3]  Peter G. Doyle,et al.  Random Walks and Electric Networks: REFERENCES , 1987 .

[4]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[5]  Sarah Zelikovitz,et al.  Evaluation of Background Knowledge for Latent Semantic Indexing Classification , 2005, FLAIRS Conference.

[6]  Juho Rousu,et al.  On Maximum Margin Hierarchical Multilabel Classification , 2005 .

[7]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[8]  Sarah Zelikovitz,et al.  Improving Text Classification with LSI Using Background Knowledge , 2007 .

[9]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[10]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[11]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[12]  J. Laurie Snell,et al.  Random Walks and Electric Networks: PREFACE , 1984 .

[13]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[14]  Jason D. M. Rennie Improving multi-class text classification with Naive Bayes , 2001 .

[15]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[16]  Maria Simi,et al.  Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization , 2000, ECDL.

[17]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[18]  Oren Kurland,et al.  PageRank without hyperlinks: structural re-ranking using links induced by language models , 2005, SIGIR '05.

[19]  Paul B. Kantor,et al.  Methods for learning classifier combinations: no clear winner , 2005, SAC '05.

[20]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[21]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[22]  Samuel Kaski,et al.  On Discriminative Joint Density Modeling , 2005, ECML.

[23]  James P. Callan,et al.  Experiments Using the Lemur Toolkit , 2001, TREC.

[24]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[25]  Günes Erkan,et al.  Language Model-Based Document Clustering Using Random Walks , 2006, NAACL.

[26]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[27]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[28]  Kjersti Aas,et al.  Text Categorisation: A Survey , 1999 .

[29]  Haym Hirsh,et al.  Integrating Background Knowledge into Nearest-Neighbor Text Classification , 2002, ECCBR.

[30]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[31]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.