Heterogeneous learner for Web page classification

Classification of an interesting class of Web pages has been an interesting problem. Typical machine learning algorithms for this problem require two classes of data for training: positive and negative training examples. However in application to Web page classification, gathering an unbiased sample of negative examples appears to be difficult. We propose a heterogeneous learning framework for classifying Web pages, which (1) eliminates the need for negative training data, and (2) increases classification accuracy by using two heterogeneous learners. Our framework uses two heterogeneous learners-a decision list and a linear separator which complement each other-to eliminate the need for negative training data in the training phase and to increase the accuracy in the testing phase. Our results show that our heterogeneous framework achieves high accuracy without requiring negative training data; it enhances the accuracy of linear separators by reducing the errors on "low-margin data". That is, it classifies more accurately while requiring less human efforts in training.

[1]  Manfred K. Warmuth,et al.  The Perceptron Algorithm Versus Winnow: Linear Versus Logarithmic Mistake Bounds when Few Input Variables are Relevant (Technical Note) , 1997, Artif. Intell..

[2]  Toshikazu Fukushima,et al.  Task-oriented world wide web retrieval by document type classification , 1999, CIKM '99.

[3]  William P. Birmingham,et al.  Improving category specific Web search by learning query modifications , 2001, Proceedings 2001 Symposium on Applications and the Internet.

[4]  Manfred K. Warmuth,et al.  The perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant , 1995, COLT '95.

[5]  Joshua R. Smith,et al.  Multi-stage classi cation of images from features and related text , 1997 .

[6]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[7]  Philip J. Hayes,et al.  CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories , 1990, IAAI.

[8]  Kevin Chen-Chuan Chang,et al.  PEBL: positive example based learning for Web page classification using SVM , 2002, KDD.

[9]  Rocco A. Servedio,et al.  Learning DNF in time , 2001, STOC '01.

[10]  Rocco A. Servedio,et al.  On PAC learning using Winnow, Perceptron, and a Perceptron-like algorithm , 1999, COLT '99.

[11]  Sung-Hyon Myaeng,et al.  A practical hypertext catergorization method using links and incrementally available class information , 2000, SIGIR '00.

[12]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[13]  Mark Craven,et al.  Relational Learning with Statistical Predicate Invention: Better Models for Hypertext , 2001, Machine Learning.

[14]  Ethem Alpaydin,et al.  MultiStage Cascading of Multiple Classifiers: One Man's Noise is Another Man's Data , 2000, ICML.

[15]  Hsinchun Chen,et al.  Internet Categorization and Search: A Self-Organizing Approach , 1996, J. Vis. Commun. Image Represent..

[16]  LittlestoneNick Learning Quickly When Irrelevant Attributes Abound , 1988 .

[17]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[18]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[19]  Alexander Kosorukoff Genetic Synthesis of Cascade Structures for Particle Classification , 1998 .

[20]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[21]  Dan Roth,et al.  SNoW User Guide , 1999 .

[22]  Neel Sundaresan,et al.  A classifier for semi-structured documents , 2000, KDD '00.

[23]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .