Aiding web crawlers; projecting web page last modification
暂无分享,去创建一个
Due to colossal amount of data on the Web, Web archivists typically make use of Web crawlers for automated collection. The Internet Archive is the largest organization based on a crawling approach in order to maintain an archive of the entire Web. The most important requirement of a Web crawler, specially when they are used for Web archiving, is to be aware of the date (and time) of last modification of a Web page. This strategy has various advantages, most important of them include i) presentation of an up-to-date version of a Web page to the end user ii) ease of adjusting the crawl rate that allows future retrieval of a Web page's version at a given date, or to compute its refresh rate. The typical way for this modification information of a Web page, that is, to use the Last-Modified: HTTP header, unfortunately does not provide correct information every time. In this work, we discuss various techniques that can be used to determine the date of last modification of a Web page with the help of experiments. This will help in adjusting the crawl rate for a specific page and also helps in presenting users with up to date information and thus allowing future versioning of a Web page more meticulous.
[1] Lars R. Clausen,et al. Concerning Etags and Datestamps , 2004 .
[2] Pierre Senellart,et al. Deriving Dynamics of Web Pages: A Survey , 2011, TWAW.
[3] Adam Jatowt,et al. Detecting age of page content , 2007, WIDM '07.
[4] Cristina Ribeiro,et al. Using neighbors to date web documents , 2007, WIDM '07.
[5] Chaomei Chen,et al. Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..