Locality sensitive hashing for scalable structural classification and clustering of web documents

Web content management systems as well as web front ends to databases usually use mechanisms based on homogeneous templates for generating and populating HTML documents containing structured, semi-structured or plain text data. Wrapper based information extraction techniques leverage such templates as an essential cornerstone of their functionality but rely heavily on the availability of proper training documents based on the specific template. Thus, structural classification and structural clustering of web documents is an important contributing factor to the success of those methods. We introduce a novel technique to support these two tasks: template fingerprints. Template fingerprints are locality sensitive hash values in the form of short sequences of characters which effectively represent the underlying template of a web document. Small changes in the document structure, as they may occur in template based documents, lead to no or only minor variations in the corresponding fingerprint. Based on the fingerprints we introduce a scalable index structure and algorithm for large collections of web documents, which can retrieve structurally similar documents efficiently. The effectiveness of our approach is empirically validated in a classification task on a data set of 13,237 documents based on 50 templates from different domains. The general efficiency and scalability is evaluated in a clustering task on a data set retrieved from the Open Directory Project comprising more than 3.6 million web documents. For both tasks, our template fingerprint approach provides results of high quality and demonstrates a linear runtime of O(n) w.r.t. the number of documents.

[1]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[2]  Thomas Gottron,et al.  Clustering Template Based Web Documents , 2008, ECIR.

[3]  Benno Stein Principles of hash-based text retrieval , 2007, SIGIR.

[4]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[5]  Thomas Gottron,et al.  DETECTING WEBSITE REDESIGNS VIA TEMPLATE SIMILARITY ON STREAMS OF DOCUMENTS , 2009 .

[6]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[7]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[8]  Lei Shi,et al.  A DOM Tree Alignment Model for Mining Parallel Data from the Web , 2006, ACL.

[9]  David Buttler,et al.  A Short Survey of Document Structure Similarity Algorithms , 2004, International Conference on Internet Computing.

[10]  Isabel F. Cruz,et al.  Measuring Structural Similarity Among Web Documents: Preliminary Results , 1998, EP.

[11]  Deepayan Chakrabarti,et al.  Page-level template detection via isotonic smoothing , 2007, WWW '07.

[12]  Sachindra Joshi,et al.  A bag of paths model for measuring structural similarity in Web documents , 2003, KDD '03.

[13]  Thomas Gottron Bridging the gap: from multi document Template Detection to single document Content Extraction , 2008, EuroIMSA 2008.

[14]  Ronald L. Rivest,et al.  The MD5 Message-Digest Algorithm , 1992, RFC.

[15]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[16]  Juliana Freire,et al.  On Finding Templates on Web Collections , 2009, World Wide Web.

[17]  Lorenzo Blanco,et al.  Highly efficient algorithms for structural clustering of large websites , 2011, WWW.

[18]  Andreas Paepcke,et al.  SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.

[20]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.