Customizable Instance-Driven Webpage Filtering Based on Semi-Supervised Learning

The World Wide Web has been growing rapidly in recent years, along with increasing needs for content-based Webpage filtering. But most existing filtering systems cannot easily satisfy the personalized filtering demands from different users at the same time. In this paper, a customizable instance-driven Webpage filtering strategy is proposed. For different users, different Webpage filters are produced by our system through mining the certain Webpage classes they focus on. A semi-supervised learning (SSL) approach is applied for obtaining a precise description of the Webpage class which a user wants to filter based on the small sized user instance set he or she provided. Subsequently, a feature selection step is performed and a Bayes classifier is created over the enlarged training set. Experimental results show the great stability and high performance of our proposed method, and it outperforms existing methods.

[1]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[2]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[3]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[4]  S. Wermter,et al.  Recurrent neural network learning for text routing , 1999 .

[5]  Nicolas Le Roux,et al.  Efficient Non-Parametric Function Induction in Semi-Supervised Learning , 2004, AISTATS.

[6]  Christo Panchev,et al.  Optimising the Hystereses of a Two Context Layer RNN for Text Classification , 2007, 2007 International Joint Conference on Neural Networks.

[7]  S. C. Hui,et al.  Neural Networks for Web Content Filtering , 2002, IEEE Intell. Syst..

[8]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[9]  Weiming Hu,et al.  Web sensitive text filtering by combining semantics and statistics , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[10]  Reihaneh Safavi-Naini,et al.  Web filtering using text classification , 2003, The 11th IEEE International Conference on Networks, 2003. ICON2003..

[11]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[12]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[13]  Jingrui He,et al.  Graph-Based Semi-Supervised Learning as a Generative Model , 2007, IJCAI.

[14]  Christo Panchev,et al.  Robust Text Classification Using a Hysteresis-Driven Extended SRN , 2007, ICANN.

[15]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[16]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[17]  Ronald Rosenfeld,et al.  Semi-supervised learning with graphs , 2005 .

[18]  Lipo Wang,et al.  Data dimensionality reduction with application to simplifying RBF network structure and improving classification performance , 2003, IEEE Trans. Syst. Man Cybern. Part B.

[19]  Tom M. Mitchell,et al.  Using unlabeled data to improve text classification , 2001 .

[20]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[21]  Filippo Menczer,et al.  Crawling the Web , 2004, Web Dynamics.