Using neighbors to date web documents

Time has been successfully used as a feature in web information retrieval tasks. In this context, estimating a document's inception date or last update date is a necessary task. Classic approaches have used HTTP header fields to estimate a document's last update time. The main problem with this approach is that it is applicable to a small part of web documents. In this work, we evaluate an alternative strategy based on a document's neighborhood. Using a random sample containing 10,000 URLs from the Yahoo! Directory, we study each document's links and media assets to determine its age. If we only consider isolated documents, we are able to date 52% of them. Including the document's neighborhood, we are able to estimate the date of more than 86% of the same sample. Also, we find that estimates differ significantly according to the type of neighbors used. The most reliable estimates are based on the document's media assets, while the worst estimates are based on incoming links. These results are experimentally evaluated with a real world application using different datasets.

[1]  Michael L. Nelson,et al.  Agreeing to disagree: search engines and their public interfaces , 2007, JCDL '07.

[2]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[3]  Torsten Suel,et al.  Local methods for estimating pagerank values , 2004, CIKM '04.

[4]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[5]  Andrei Z. Broder,et al.  Sic transit gloria telae: towards an understanding of the web's decay , 2004, WWW '04.

[6]  Lars R. Clausen,et al.  Concerning Etags and Datestamps , 2004 .

[7]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[8]  David Carmel,et al.  Trend detection through temporal link analysis , 2004, J. Assoc. Inf. Sci. Technol..

[9]  Masatoshi Yoshikawa,et al.  Refinement of TF-IDF schemes for web pages using their hyperlinked neighboring pages , 2003, HYPERTEXT '03.

[10]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[11]  Ricardo A. Baeza-Yates,et al.  Web Structure, Dynamics and Page Quality , 2002, SPIRE.

[12]  Kam-Fai Wong,et al.  An Overview of Temporal Information Extraction , 2005, Int. J. Comput. Process. Orient. Lang..

[13]  Gerhard Weikum,et al.  T-Rank: Time-Aware Authority Ranking , 2004, WAW.

[14]  Geoffrey M. Voelker,et al.  Characterization of a Large Web Site Population with Implications for Content Delivery , 2004, WWW '04.

[15]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[16]  Mitsuru Ishizuka,et al.  Temporal multi-page summarization , 2006, Web Intell. Agent Syst..

[17]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[18]  Daniel Gomes,et al.  Modelling information persistence on the web , 2006, ICWE '06.