Modeling and managing content changes in text databases

Large amounts of (often valuable) information are stored in Web-accessible text databases. "Metasearchers" provide unified interfaces to query multiple such databases at once. For efficiency, metasearchers rely on succinct statistical summaries of the database contents to select the best databases for each query. So far, database selection research has largely assumed that databases are static, so the associated statistical summaries do not need to change over time. However, databases are rarely static and the statistical summaries that describe their contents need to be updated periodically to reflect content changes. In this paper, we first report the results of a study showing how the content summaries of 152 real Web databases evolved over a period of 52 weeks. Then, we show how to use "survival analysis" techniques in general, and Cox's proportional hazards regression in particular, to model database changes over time and predict when we should update each content summary. Finally, we exploit our change model to devise update schedules that keep the summaries up to date by contacting databases only when needed, and then we evaluate the quality of our schedules experimentally over real Web databases.

[1]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[2]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[3]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[4]  Martin Bergman,et al.  The deep web:surfacing the hidden value , 2000 .

[5]  R. Fildes Journal of the Royal Statistical Society (B): Gary K. Grunwald, Adrian E. Raftery and Peter Guttorp, 1993, “Time series of continuous proportions”, 55, 103–116.☆ , 1993 .

[6]  Luis Gravano,et al.  GlOSS: text-source discovery over the Internet , 1999, TODS.

[7]  David R. Cox,et al.  Regression models and life tables (with discussion , 1972 .

[8]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[9]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[10]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[11]  Luis Gravano,et al.  When one sample is not enough: improving text database selection using shrinkage , 2004, SIGMOD '04.

[12]  Alexandros Ntoulas,et al.  Effective Change Detection Using Sampling , 2002, VLDB.

[13]  Luis Gravano,et al.  STARTS: Stanford proposal for Internet meta-searching , 1997, SIGMOD '97.

[14]  W. H. Carter,et al.  Analysis of survival data with nonproportional hazard functions. , 1981, Controlled clinical trials.

[15]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[16]  Jeffrey Scott Vitter,et al.  Characterizing Web Document Change , 2001, WAIM.

[17]  Jorge J. Moré,et al.  The Levenberg-Marquardt algo-rithm: Implementation and theory , 1977 .

[18]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[19]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[20]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.

[21]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[22]  Luis Gravano,et al.  QProber: A system for automatic classification of hidden-Web databases , 2003, TOIS.

[23]  Forouzan Golshani,et al.  Proceedings of the Eighth International Conference on Data Engineering , 1992 .

[24]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[25]  CallanJamie,et al.  Query-based sampling of text databases , 2001 .

[26]  Jennifer Widom,et al.  Best-effort cache synchronization with source cooperation , 2002, SIGMOD '02.

[27]  D. A. Bell,et al.  Applied Statistics , 1953, Nature.

[28]  George Cybenko,et al.  Keeping up with the changing Web , 2000, Computer.

[29]  Zhen Liu,et al.  Optimal Robot Scheduling for Web Search Engines , 1998 .

[30]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[31]  Craig E. Wills,et al.  Towards a Better Understanding of Web Resources and Server Responses for Improved Caching , 1999, Comput. Networks.

[32]  Jamie Callan,et al.  DISTRIBUTED INFORMATION RETRIEVAL , 2002 .

[33]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .