A method for managing access to web pages: Filtering by Statistical Classification (FSC) applied to text

Various entities (e.g., parents, employers) that provide users (e.g., children, employees) access to web content wish to limit the content accessed through those computers. Available filtering methods are crude in that they too often block "acceptable" content while failing to block "unacceptable" content. This paper presents a general and flexible classification method based on statistical techniques applied to text material, that we call, Filtering by Statistical Classification (FSC). According to each individual entity's expressed opinions about what content in a training data set is or is not acceptable, FSC constructs a customized model to represent each individual entity's preferences. FSC then uses this customized model to examine new web content and to block unwanted content. The empirical results suggest that our method has greater predictive power than do a variety of existing approaches.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Andrew L. Rukhin,et al.  Tools for statistical inference , 1991 .

[3]  Pattie Maes,et al.  Collaborative Interface Agents , 1994, AAAI.

[4]  Pattie Maes,et al.  Evolving agents for personalized information filtering , 1993, Proceedings of 9th IEEE Conference on Artificial Intelligence for Applications.

[5]  Yoav Shoham,et al.  Fab: content-based, collaborative recommendation , 1997, CACM.

[6]  Jonathan Weinberg Rating the Net , 1996 .

[7]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[8]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[9]  Yoichi Shinoda,et al.  Information filtering based on user behavior analysis and best match text retrieval , 1994, SIGIR '94.

[10]  P. Resnick FILTERING INFORMATION ON THE INTERNET , 1997 .

[11]  A. Gibbons Algorithmic Graph Theory , 1985 .

[12]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[13]  Marc El-Bèze,et al.  Query Expansion and Classification of Retrieved Documents , 1998, TREC.

[14]  Enrique Vidal,et al.  A R ELATIVE APPROACH TO HIERARCHICAL CLUSTERING , 2000 .

[15]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI): TREC-3 Report , 1994, TREC.

[16]  S. C. Hui,et al.  Neural Networks for Web Content Filtering , 2002, IEEE Intell. Syst..

[17]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[18]  Wolfgang Rosenstiel,et al.  Automatic Generation of Local Internet Catalogues Using the Hierarchical Radius-based Competitive Learning , 2000, ECAI.

[19]  Huan Liu,et al.  Book review: Machine Learning, Neural and Statistical Classification Edited by D. Michie, D.J. Spiegelhalter and C.C. Taylor (Ellis Horwood Limited, 1994) , 1996, SGAR.

[20]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[21]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[22]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[23]  Michael J. Pazzani,et al.  Syskill & Webert: Identifying Interesting Web Sites , 1996, AAAI/IAAI, Vol. 1.

[24]  Donna Harman,et al.  The Second Text Retrieval Conference (TREC-2) , 1995, Inf. Process. Manag..

[25]  Mohammad Bagher Menhaj,et al.  A soft probabilistic neural network for implementation of Bayesian classifiers , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[26]  Paul Resnick,et al.  PICS: Internet access controls without censorship , 1996, CACM.

[27]  Peretz Shoval,et al.  Experimentation with an information filtering system that combines cognitive and sociological filtering integrated with user stereotypes , 1999, Decis. Support Syst..

[28]  James Ze Wang,et al.  Classifying Objectionable Websites Based on Image Content , 1998, IDMS.

[29]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[30]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[31]  Sumit Sarkar,et al.  Bayesian Models for Early Warning of Bank Failures , 2001, Manag. Sci..

[32]  William F. Punch,et al.  Finding Salient Features for Personal Web Page Categories , 1997, Comput. Networks.

[33]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[34]  Pedro Larrañaga,et al.  Using Bayesian networks in the construction of a bi-level multi-classifier. A case study using intensive care unit patients data , 2001, Artif. Intell. Medicine.

[35]  Torben Hagerup,et al.  A Guided Tour of Chernoff Bounds , 1990, Inf. Process. Lett..

[36]  Gary Boone,et al.  Concept features in Re:Agent, an intelligent Email agent , 1998, AGENTS '98.

[37]  Pattie Maes,et al.  Learning Interface Agents , 1993, AAAI.

[38]  Ramayya Krishnan,et al.  Filtering objectionable internet content , 1999, ICIS.

[39]  Sean M. McNee,et al.  Getting to know you: learning new user preferences in recommender systems , 2002, IUI '02.

[40]  L. Joseph,et al.  Bayesian sample size determination for normal means and differences between normal means , 1997 .

[41]  Henry Lieberman,et al.  Letizia: An Agent That Assists Web Browsing , 1995, IJCAI.

[42]  John Riedl,et al.  Explaining collaborative filtering recommendations , 2000, CSCW '00.