Hierarchical Web-Page Clustering via In-Page and Cross-Page Link Structures

Despite of the wide diversity of web-pages, web-pages residing in a particular organization, in most cases, are organized with semantically hierarchic structures For example, the website of a computer science department contains pages about its people, courses and research, among which pages of people are categorized into faculty, staff and students, and pages of research diversify into different areas Uncovering such hierarchic structures could supply users a convenient way of comprehensive navigation and accelerate other web mining tasks In this study, we extract a similarity matrix among pages via in-page and crosspage link structures, based on which a density-based clustering algorithm is developed, which hierarchically groups densely linked webpages into semantic clusters Our experiments show that this method is efficient and effective, and sheds light on mining and exploring web structures.

[1]  Jiawei Han,et al.  Association Mining in Large Databases: A Re-examination of Its Measures , 2007, PKDD.

[2]  Morteza Haghir Chehreghani,et al.  Improving density-based methods for hierarchical clustering of web pages , 2008, Data Knowl. Eng..

[3]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[4]  Václav Snásel,et al.  Web Pages Reordering and Clustering Based on Web Patterns , 2008, SOFSEM.

[5]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[6]  Mária Bieliková,et al.  SOFSEM 2008: Theory and Practice of Computer Science, 34th Conference on Current Trends in Theory and Practice of Computer Science, Nový Smokovec, Slovakia, January 19-25, 2008, Proceedings , 2008, SOFSEM.

[7]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[8]  Joost N. Kok,et al.  Knowledge Discovery in Databases: PKDD 2007, 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, Warsaw, Poland, September 17-21, 2007, Proceedings , 2007, PKDD.

[9]  Yi-Ouyang,et al.  EHM-Based Web Pages Fuzzy Clustering Algorithm , 2007, 2007 International Conference on Multimedia and Ubiquitous Engineering (MUE'07).

[10]  Valter Crescenzi,et al.  Clustering Web pages based on their structure , 2005, Data Knowl. Eng..

[11]  Yanchun Zhang,et al.  Utilizing Hyperlink Transitivity to Improve Web Page Clustering , 2003, ADC.

[12]  James Allan,et al.  Web Page Clustering Using Heuristic Search in the Web Graph , 2007, IJCAI.

[13]  Oren Etzioni,et al.  Web document clustering , 1998, SIGIR 1998.

[14]  Masaru Kitsuregawa,et al.  Evaluating contents-link coupled web page clustering for web search results , 2002, CIKM '02.

[15]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[16]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[17]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[18]  Xiaowei Xu,et al.  SCAN: a structural clustering algorithm for networks , 2007, KDD '07.

[19]  Brian D. Davison,et al.  Knowing a web page by the company it keeps , 2006, CIKM '06.