Automatic Detection of Shared Fragments in Large Collections of Web Pages and its Applications

To reduce network-related delays in serving dynamic web pages, various approaches have been proposed. However, one of the common fundamental problems encountered in some representatives of them is how to automatically find shared fragments in large numbers of web pages. Besides, this problem is also encountered in studies of web content characteristics at fragment granularity. This paper gives a formal definition of the problem, presents an efficient and scalable algorithm for it, and introduces the applications of the algorithm. In the problem definition, we introduce the notion of compound fragment, and our definition of maximal shared fragment captures the real characteristics of fragments that are appropriate for delivery and caching individually. Our algorithm has two unique features: (1) it is able to find real maximal shared fragments (2) it is able to effectively handle large collections of web pages by utilizing database techniques. The algorithm has been implemented and applied to 16 large sets of web pages. The experiments show that the algorithm can effectively handle large numbers of web pages, and can provide significant bandwidth saving and latency reduction when used in fragment-based web caching.

[1]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[2]  Tao Yang,et al.  Exploiting Result Equivalence in Caching Dynamic Web Content , 1999, USENIX Symposium on Internet Technologies and Systems.

[3]  Konstantinos Psounis Class-based delta-encoding: a scalable scheme for caching dynamic Web content , 2002, Proceedings 22nd International Conference on Distributed Computing Systems Workshops.

[4]  Craig E. Wills,et al.  Exploiting Object Relationships for Deterministic Web Object Management , 2002 .

[5]  Xiaodong Zhang,et al.  Detective Browsers : A Software Technique to Improve Web Access Performance and Security , .

[6]  Xiang Liu,et al.  Web caching for database applications with Oracle Web Cache , 2002, SIGMOD '02.

[7]  Mor Naaman,et al.  Evaluation of ESI and Class-Based Delta Encoding , 2003, WCW.

[8]  Anja Feldmann,et al.  Potential benefits of delta encoding and data compression for HTTP , 1997, SIGCOMM '97.

[9]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[10]  Zhimin Gu,et al.  TeCaS: A Template Caching System for Dynamic Web Pages , 2006, Advanced Int'l Conference on Telecommunications and Int'l Conference on Internet and Web Applications and Services (AICT-ICIW'06).

[11]  Weisong Shi,et al.  Accelerating Dynamic Web Content Delivery Using Keyword-based Fragment Detection , 2005, J. Web Eng..

[12]  Tarek F. Abdelzaher,et al.  An architecture for real-time active content distribution , 2004, Proceedings. 16th Euromicro Conference on Real-Time Systems, 2004. ECRTS 2004..

[13]  Claus Brabrand,et al.  Language-Based Caching of Dynamically Generated HTML , 2001 .

[14]  Torsten Suel,et al.  zdelta: An efficient delta compression tool , 2002 .

[15]  Fred Douglis,et al.  HPP: HTML Macro-Preprocessing to Support Dynamic Document Caching , 1997, USENIX Symposium on Internet Technologies and Systems.

[16]  Arun Iyengar,et al.  A publishing system for efficiently creating dynamic Web content , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[17]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[18]  An Architecture for On-Demand Active Web Content Distribution , 2007 .

[19]  Wei-Ying Ma,et al.  Visual Based Content Understanding towards Web Adaptation , 2002, AH.

[20]  Divyakant Agrawal,et al.  Cache Portal: Technology for Accelerating Database-driven e-commerce Web Sites , 2001, VLDB.

[21]  Jin Zhang,et al.  Active Cache: caching dynamic contents on the Web , 1999, Distributed Syst. Eng..

[22]  Zheng Zhang,et al.  Proxy+: Simple Proxy Augmentation for Dynamic Content Processing , 2003, WCW.

[23]  Lakshmish Ramaswamy,et al.  Automatic detection of fragments in dynamically generated web pages , 2004, WWW '04.

[24]  Weisong Shi,et al.  Modeling object characteristics of dynamic Web content , 2002, Global Telecommunications Conference, 2002. GLOBECOM '02. IEEE.

[25]  Divyakant Agrawal,et al.  Enabling dynamic content caching for database-driven web sites , 2001, SIGMOD '01.

[26]  Arun Iyengar,et al.  A scalable system for consistently caching dynamic Web data , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[27]  William E. Weihl,et al.  Edgecomputing: extending enterprise applications to the edge of the internet , 2004, WWW Alt. '04.