A Review of Web Document Clustering Approaches

Nowadays, the Internet has become the largest data repository, facing the problem of information overload. Though, the web search environment is not ideal. The existence of an abundance of information, in combination with the dynamic and heterogeneous nature of the Web, makes information retrieval a difficult process for the average user. It is a valid requirement then the development of techniques that can help the users effectively organize and browse the available information, with the ultimate goal of satisfying their information need. Cluster analysis, which deals with the organization of a collection of objects into cohesive groups, can play a very important role towards the achievement of this objective. In this chapter, we present an exhaustive survey of web document clustering approaches available on the literature, classified into three main categories: text-based, link-based and hybrid. Furthermore, we present a thorough comparison of the algorithms based on the various facets of their features and functionality. Finally, based on the review of the different approaches we conclude that although clustering has been a topic for the scientific community for three decades, there are still many open issues that call for more research.

[1]  Marti A. Hearst Text Data Mining , 2005 .

[2]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[3]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[4]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[5]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[6]  David Stotts,et al.  Proceedings of the the seventh ACM conference on Hypertext , 1996 .

[7]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[8]  Ellen M. Voorhees,et al.  Implementing agglomerative hierarchic clustering algorithms for use in document retrieval , 1986, Inf. Process. Manag..

[9]  Chanathip Namprempre,et al.  HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering , 1996, HYPERTEXT '96.

[10]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[11]  Pavel Blagoveston Bochev,et al.  A vector space model for information retrieval with generalized similarity measures. , 2012 .

[12]  B. Everitt,et al.  Finite Mixture Distributions , 1981 .

[13]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[14]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[15]  KumarRavi,et al.  Trawling the Web for emerging cyber-communities , 1999 .

[16]  Carl G. Looney,et al.  A Fuzzy Clustering and Fuzzy Merging Algorithm , 2000 .

[17]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[18]  Ying Ding IR and AI : The role of ontology , 2001 .

[19]  W. Scott Spangler,et al.  Clustering hypertext with applications to web searching , 2000, HYPERTEXT '00.

[20]  Peter Willett,et al.  Comparison of Hierarchie Agglomerative Clustering Methods for Document Retrieval , 1989, Comput. J..

[21]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[22]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[23]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[24]  Ben Shneiderman,et al.  Identifying aggregates in hypertext structures , 1991, HYPERTEXT '91.

[25]  W. Bruce Croft,et al.  Retrieval Strategies for Hypertext , 1993, Inf. Process. Manag..

[26]  Edward A. Fox,et al.  The 4th International Conference of Asian Digital Libraries: Digital Libraries: Dynamic Landscapes for Knowledge Creation, Access and Management December 10-12, 2001, Bangalore, India , 2002, D-Lib Magazine.

[27]  Ray R. Larson,et al.  Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace , 1996 .

[28]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[29]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[30]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[31]  Ramana Rao,et al.  Silk from a sow's ear: extracting usable structures from the Web , 1996, CHI.

[32]  Rodrigo A. Botafogo Cluster analysis for hypertext systems , 1993, SIGIR.

[33]  Hans-Peter Frei,et al.  The Use of Semantic Links in Hypertext Information Retrieval , 1995, Inf. Process. Manag..

[34]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[35]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[36]  D. Merkl Text Data Mining , 1998 .

[37]  Peter Willett,et al.  Hierarchic document classification using Ward's clustering method , 1986, SIGIR '86.

[38]  Peter Willett,et al.  Hierarchic Document Clustering Using Ward's Method. , 1986, SIGIR 1986.

[39]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[40]  W. Klein,et al.  Bibliometrics , 2005, Social work in health care.

[41]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[42]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[43]  D. Defays,et al.  An Efficient Algorithm for a Complete Link Method , 1977, Comput. J..

[44]  Vipin Kumar,et al.  WebACE: a Web agent for document categorization and exploration , 1998, AGENTS '98.

[45]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[46]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.