THESUS, a closer view on Web content management enhanced with link semantics

With the unstoppable growth of the world wide Web, the great success of Web search engines, such as Google and AltaVista, users now turn to the Web whenever looking for information. However, many users are neophytes when it comes to computer science, yet they are often specialists of a certain domain. These users would like to add more semantics to guide their search through world wide Web material, whereas currently most search features are based on raw lexical content. We show how the use of the incoming links of a page can be used efficiently to classify a page in a concise manner. This enhances the browsing and querying of Web pages. We focus on the tools needed in order to manage the links and their semantics. We further process these links using a hierarchy of concepts, akin to an ontology, and a thesaurus. This work is demonstrated by an prototype system, called THESUS, that organizes thematic Web documents into semantic clusters. Our contributions are the following: 1) a model and language to exploit link semantics information, 2) the THESUS prototype system, 3) its innovative aspects and algorithms, more specifically, the novel similarity measure between Web documents applied to different clustering schemes (DB-Scan and COBWEB), and 4) a thorough experimental evaluation proving the value of our approach.

[1]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[2]  Monika Henzinger,et al.  Hyperlink Analysis for the Web , 2001, IEEE Internet Comput..

[3]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[4]  Robert Wilensky,et al.  Robust Hyperlinks Cost Just Five Words Each , 2000 .

[5]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[6]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[7]  Prabhakar Raghavan,et al.  Mining the Link Structure of the World Wide Web , 1998 .

[8]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[9]  Heikki Mannila,et al.  Distance measures for point sets and their computation , 1997, Acta Informatica.

[10]  Iraklis Varlamis,et al.  THESUS: Organizing Web document collections based on link semantics , 2003, The VLDB Journal.

[11]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[12]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[13]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[14]  Nicola Guarino,et al.  Formal Ontology and Information Systems , 1998 .

[15]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[16]  Piotr Indyk,et al.  Scalable Techniques for Clustering the Web , 2000, WebDB.

[17]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[18]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[19]  Dekang Lin,et al.  WordNet: An Electronic Lexical Database , 1998 .

[20]  GunopulosDimitrios,et al.  Efficient and tumble similar set retrieval , 2001 .

[21]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[22]  Alberto O. Mendelzon,et al.  Applications of a Web Query Language , 1997, Comput. Networks.

[23]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[24]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[25]  Thorsten Joachims,et al.  WebWatcher : A Learning Apprentice for the World Wide Web , 1995 .

[26]  Iraklis Varlamis,et al.  Web document searching using enhanced hyperlink semantics based on XML , 2001, Proceedings 2001 International Database Engineering and Applications Symposium.

[27]  Michalis Vazirgiannis,et al.  A Data Set Oriented Approach for Clustering Algorithm Selection , 2001, PKDD.

[28]  Víctor Pàmies,et al.  Open Directory Project , 2003 .

[29]  Benjamin W. Wah,et al.  Editorial: Two Named to Editorial Board of IEEE Transactions on Knowledge and Data Engineering , 1996 .

[30]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.