Clustering Template Based Web Documents

More and more documents on theWorldWideWeb are based on templates. On a technical level this causes those documents to have a quite similar source code and DOM tree structure. Grouping together documents which are based on the same template is an important task for applications that analyse the template structure and need clean training data. This paper develops and compares several distance measures for clustering web documents according to their underlying templates. Combining those distance measures with different approaches for clustering, we show which combination of methods leads to the desired result.

[1]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[2]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[3]  Sachindra Joshi,et al.  A bag of paths model for measuring structural similarity in Web documents , 2003, KDD '03.

[4]  Deepayan Chakrabarti,et al.  Page-level template detection via isotonic smoothing , 2007, WWW '07.

[5]  David Buttler,et al.  A Short Survey of Document Structure Similarity Algorithms , 2004, International Conference on Internet Computing.

[6]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[7]  A. E. Eiben,et al.  Evolutionary Programming VII , 1998, Lecture Notes in Computer Science.

[8]  Isabel F. Cruz,et al.  Measuring Structural Similarity Among Web Documents: Preliminary Results , 1998, EP.

[9]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[10]  Andrew Tomkins,et al.  The volume and evolution of web page templates , 2005, WWW '05.

[11]  I. V. Ramakrishnan,et al.  On the complexity of schema inference from web pages in the presence of nullable data attributes , 2003, CIKM '03.

[12]  J. Kruskal Nonmetric multidimensional scaling: A numerical method , 1964 .

[13]  Sandip Debnath,et al.  Automatic extraction of informative blocks from webpages , 2005, SAC '05.

[14]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[15]  Lei Shi,et al.  A DOM Tree Alignment Model for Mining Parallel Data from the Web , 2006, ACL.

[16]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[17]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[18]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[19]  Sasu Tarkoma,et al.  Fast and simple XML tree differencing by sequence alignment , 2006, DocEng '06.