A matrix approach for hierarchical web page clustering based in hyperlinks

This paper proposes a matrix approach for hierarchical web page clustering with two algorithms using hyperlink information among pages.One clustering algorithm clusters web pages without considering cluster overlapping.Another one takes cluster overlapping into account.These algorithms take advantage of intrinsic relationships among the pages, and are independent of the order in which the pages are presented to the algorithms.Furthermore, the proposed algorithms do not require a predefined similarity threshold for clustering.They are easy to be implemented for web applications.The primary evaluations show the effectiveness of the proposed algorithms, as well as a promising application.

[1]  Yanchun Zhang,et al.  Constructing Good Quality Web Page Communities , 2002, Australasian Database Conference.

[2]  Ramana Rao,et al.  Silk from a sow's ear: extracting usable structures from the Web , 1996, CHI.

[3]  Rodrigo A. Botafogo Cluster analysis for hypertext systems , 1993, SIGIR.

[4]  Yanchun Zhang,et al.  Measuring similarity of interests for clustering Web-users , 2001, Proceedings 12th Australasian Database Conference. ADC 2001.

[5]  Sougata Mukherjea,et al.  Focus+context views of World-Wide Web nodes , 1997, HYPERTEXT '97.

[6]  Yitong Wang,et al.  Use link-based clustering to improve Web search results , 2001, Proceedings of the Second International Conference on Web Information Systems Engineering.

[7]  Peter Pirolli,et al.  Life, death, and lawfulness on the electronic frontier , 1997, CHI.

[8]  Haifeng Jiang,et al.  Three-Tier Clustering: An Online Citation Clustering System , 2001, WAIM.

[9]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[10]  Rick Kazman,et al.  WebQuery: Searching and Visualizing the Web Through Connectivity , 1997, Comput. Networks.

[11]  Chanathip Namprempre,et al.  HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering , 1996, HYPERTEXT '96.

[12]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[13]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[14]  Loren G. Terveen,et al.  Finding and visualizing inter-site clan graphs , 1998, CHI.

[15]  Massimo Marchiori,et al.  The Quest for Correct Information on the Web: Hyper Search Engines , 1997, Comput. Networks.

[16]  Ben Shneiderman,et al.  Identifying aggregates in hypertext structures , 1991, HYPERTEXT '91.

[17]  Lipo Wang,et al.  On competitive learning , 1997, IEEE Trans. Neural Networks.

[18]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[19]  Andrei Z. Broder,et al.  The Connectivity Server: Fast Access to Linkage Information on the Web , 1998, Comput. Networks.

[20]  Ben Shneiderman,et al.  Structural analysis of hypertexts: identifying hierarchies and useful metrics , 1992, TOIS.

[21]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[22]  Huan Liu,et al.  A Distributed Hierarchical Clustering System for Web Mining , 2001, WAIM.

[23]  Paul J. Schweitzer,et al.  Problem Decomposition and Data Reorganization by a Clustering Technique , 1972, Oper. Res..

[24]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.