PEBL: positive example based learning for Web page classification using SVM

Web page classification is one of the essential techniques for Web mining. Specifically, classifying Web pages of a user-interesting class is the first step of mining interesting information from the Web. However, constructing a classifier for an interesting class requires laborious pre-processing such as collecting positive and negative training examples. For instance, in order to construct a "homepage" classifier, one needs to collect a sample of homepages (positive examples) and a sample of non-homepages (negative examples). In particular, collecting negative training examples requires arduous work and special caution to avoid biasing them. We introduce in this paper the Positive Example Based Learning (PEBL) framework for Web page classification which eliminates the need for manually collecting negative training examples in pre-processing. We present an algorithm called Mapping-Convergence (M-C) that achieves classification accuracy (with positive and unlabeled data) as high as that of traditional SVM (with positive and negative data). Our experiments show that when the M-C algorithm uses the same amount of positive examples as that of traditional SVM, the M-C algorithm performs as well as traditional SVM.

[1]  Neel Sundaresan,et al.  A classifier for semi-structured documents , 2000, KDD '00.

[2]  H. Mase Experiments on Automatic Web Page Categorization for IR system , 1998 .

[3]  Sung-Hyon Myaeng,et al.  A practical hypertext catergorization method using links and incrementally available class information , 2000, SIGIR '00.

[4]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[5]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[6]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[7]  William P. Birmingham,et al.  Improving category specific Web search by learning query modifications , 2001, Proceedings 2001 Symposium on Applications and the Internet.

[8]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[9]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[10]  Frann Cois Denis,et al.  PAC Learning from Positive Statistical Queries , 1998, ALT.

[11]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[12]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[13]  C. Lee Giles,et al.  DEADLINER: building a new niche search engine , 2000, CIKM '00.

[14]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[16]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[17]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[18]  Rémi Gilleron,et al.  Positive and Unlabeled Examples Help Learning , 1999, ALT.

[19]  Eddy Mayoraz Multiclass Classification with Pairwise Coupled Neural Networks or Support Vector Machines , 2001, ICANN.

[20]  Ada Wai-Chee Fu,et al.  Finding Structure and Characteristics of Web Documents for Classification , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[21]  Robert P. W. Duin,et al.  Uniform Object Generation for Optimizing One-class Classifiers , 2002, J. Mach. Learn. Res..