PEBL: Web page classification without negative examples

Web page classification is one of the essential techniques for Web mining because classifying Web pages of an interesting class is often the first step of mining the Web. However, constructing a classifier for an interesting class requires laborious preprocessing such as collecting positive and negative training examples. For instance, in order to construct a "homepage" classifier, one needs to collect a sample of homepages (positive examples) and a sample of nonhomepages (negative examples). In particular, collecting negative training examples requires arduous work and caution to avoid bias. The paper presents a framework, called positive example based learning (PEBL), for Web page classification which eliminates the need for manually collecting negative training examples in preprocessing. The PEBL framework applies an algorithm, called mapping-convergence (M-C), to achieve high classification accuracy (with positive and unlabeled data) as high as that of a traditional SVM (with positive and negative data). M-C runs in two stages: the mapping stage and convergence stage. In the mapping stage, the algorithm uses a weak classifier that draws an initial approximation of "strong" negative data. Based on the initial approximation, the convergence stage iteratively runs an internal classifier (e.g., SVM) which maximizes margins to progressively improve the approximation of negative data. Thus, the class boundary eventually converges to the true boundary of the positive class in the feature space. We present the M-C algorithm with supporting theoretical and experimental justifications. Our experiments show that, given the same set of positive examples; the M-C algorithm outperforms one-class SVMs, and it is almost as accurate as the traditional SVMs.

[1]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[2]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[3]  Bernhard Schölkopf,et al.  Support Vector Method for Novelty Detection , 1999, NIPS.

[4]  Stanley M. Bileschi,et al.  Advances in component based face detection , 2003, 2003 IEEE International SOI Conference. Proceedings (Cat. No.03CH37443).

[5]  Kevin Chen-Chuan Chang,et al.  PEBL: positive example based learning for Web page classification using SVM , 2002, KDD.

[6]  William P. Birmingham,et al.  Improving category specific Web search by learning query modifications , 2001, Proceedings 2001 Symposium on Applications and the Internet.

[7]  Neel Sundaresan,et al.  A classifier for semi-structured documents , 2000, KDD '00.

[8]  Bernhard Schölkopf,et al.  SV Estimation of a Distribution's Support , 1999, NIPS 1999.

[9]  Rémi Gilleron,et al.  Learning from positive and unlabeled examples , 2000, Theor. Comput. Sci..

[10]  Rémi Gilleron,et al.  Positive and Unlabeled Examples Help Learning , 1999, ALT.

[11]  Eddy Mayoraz Multiclass Classification with Pairwise Coupled Neural Networks or Support Vector Machines , 2001, ICANN.

[12]  Ada Wai-Chee Fu,et al.  Finding Structure and Characteristics of Web Documents for Classification , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[13]  Sung-Hyon Myaeng,et al.  A practical hypertext catergorization method using links and incrementally available class information , 2000, SIGIR '00.

[14]  Robert P. W. Duin,et al.  Uniform Object Generation for Optimizing One-class Classifiers , 2002, J. Mach. Learn. Res..

[15]  Frann Cois Denis,et al.  PAC Learning from Positive Statistical Queries , 1998, ALT.

[16]  Giovanni Soda,et al.  Autoassociator-based models for speaker verification , 1996, Pattern Recognit. Lett..

[17]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[18]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[19]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[20]  C. Lee Giles,et al.  DEADLINER: building a new niche search engine , 2000, CIKM '00.

[21]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[22]  Robert P. W. Duin,et al.  Support vector domain description , 1999, Pattern Recognit. Lett..

[23]  H. Mase Experiments on Automatic Web Page Categorization for IR system , 1998 .

[24]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[25]  Marco Gori,et al.  A neural network-based model for paper currency recognition and verification , 1996, IEEE Trans. Neural Networks.

[26]  Hsinchun Chen,et al.  Internet Categorization and Search: A Self-Organizing Approach , 1996, J. Vis. Commun. Image Represent..

[27]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[28]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[29]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[30]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[31]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .