Web Archive System for Efficient Storage of Web History Information

The growth of web has brought convenience for people accessing large amounts of information. Most people depend on the web for obtaining information. Generally, data on the web is updated and deleted by the web server manager, which results in much previous information disappearing from existence regardless of importance. For this reason, a web archive system is studied to efficiently manage valuable data produced over a long period of time. However, the existing web archive system doesn't support systematic processing and management of data before updating. In addition, storage systems are not efficient when storing large quantities of web information. In this paper, the proposed method uses a special crawler for collecting web history information. The crawler of WebBase can reduce overhead in web page collection. It can store deleting web information using a RCS. Thus, web history information can be stored and accessed efficiently.

[1]  James J. Hunt,et al.  Using the Web for document versioning: an implementation report for DeltaV , 2001, Proceedings of the 23rd International Conference on Software Engineering. ICSE 2001.

[2]  Fred Douglis,et al.  WebGUIDE: Querying and Navigating Changes in Web Repositories , 1996, Comput. Networks.

[3]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[4]  João P. Campos Versus: a Web Data Repository with Time Support , 2003 .

[5]  Sriram Raghavan,et al.  WebBase: a repository of Web pages , 2000, Comput. Networks.

[6]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[7]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[8]  Christos Faloutsos,et al.  Access methods for text , 1985, CSUR.

[9]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.