ChangeDetector™: a site-level monitoring tool for the WWW

This paper presents a new challenge for Web monitoring tools: to build a system that can monitor entire web sites effectively. Such a system could potentially be used to discover "silent news" hidden within corporate web sites. Examples of silent news include reorganizations in the executive team of a company or in the retirement of a product line. ChangeDetector, an implemented prototype, addresses this challenge by incorporating a number of machine learning techniques. The principal backend components of ChangeDetector all rely on machine learning: intelligent crawling, page classification and entity-based change detection. Intelligent crawling enables ChangeDetector to selectively crawl the most relevant pages of very large sites. Classification allows change detection to be filtered by topic. Entity extraction over changed pages permits change detection to be filtered by semantic concepts, such as person names, dates, addresses, and phone numbers. Finally, the front end presents a flexible way for subscribers to interact with the database of detected changes to pinpoint those changes most likely to be of interest.

[1]  Mark S. Ackerman,et al.  The Do-I-Care Agent: Effective Social Discovery and Filtering on the Web , 1997, RIAO.

[2]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[3]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[4]  Jean-Luc Meunier,et al.  Collaborative document monitoring , 2001, GROUP '01.

[5]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[6]  Ion Muslea,et al.  Extraction Patterns for Information Extraction Tasks: A Survey , 1999 .

[7]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[8]  Fred Douglis,et al.  TopBlend: An Efficient Implementation of HtmlDiff in Java , 2000, WebNet.

[9]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[10]  Yiming Yang,et al.  Hypertext Categorization using Hyperlink Patterns and Meta Data , 2001, ICML.

[11]  Yih-Farn Robin Chen,et al.  Website News: A Website Tracking and Visualization Service , 1998, WebNet.

[12]  Douglas E. Appelt,et al.  FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text , 1997, ArXiv.

[13]  Fred Douglis,et al.  The AT&T Internet Difference Engine: Tracking and viewing changes on the web , 1998, World Wide Web.

[14]  J. Shewchuk An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[15]  Thomas P. Minka,et al.  Algorithms for maximum-likelihood logistic regression , 2003 .

[16]  Daniel S. Hirschberg,et al.  Algorithms for the Longest Common Subsequence Problem , 1977, JACM.

[17]  Olivier Dedieu Pluxy : un proxy Web dynamiquement extensible , 1998 .