An Overview of Web Data Clustering Practices

Clustering is a challenging topic in the area of Web data management Various forms of clustering are required in a wide range of applications, including finding mirrored Web pages, detecting copyright violations, and reporting search results in a structured way Clustering can either be performed once offline, (independently to search queries), or online (on the results of search queries) Important efforts have focused on mining Web access logs and to cluster search engine results on the fly Online methods based on link structure and text have been applied successfully to finding pages on related topics This paper presents an overview of the most popular methodologies and implementations in terms of clustering either Web users or Web sources and presents a survey about current status and future trends in clustering employed over the Web.

[1]  Gerhard Weikum,et al.  The Index-Based XXL Search Engine for Querying XML Data with Relevance Ranking , 2002, EDBT.

[2]  Vijay V. Raghavan,et al.  BitCube: A Three-Dimensional Bitmap Indexing for XML Documents , 2004, Journal of Intelligent Information Systems.

[3]  Jaideep Srivastava,et al.  Data Preparation for Mining World Wide Web Browsing Patterns , 1999, Knowledge and Information Systems.

[4]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[5]  Arindam Banerjee,et al.  Clickstream clustering using weighted longest common subsequences , 2001 .

[6]  PatternsYongjian,et al.  Clustering of Web Users Based on Access , 1999 .

[7]  Kevin S. McCurley,et al.  Untangling compound documents on the web , 2003, HYPERTEXT '03.

[8]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[9]  Pierre Baldi,et al.  Modeling the Internet and the Web: Probabilistic Method and Algorithms , 2002 .

[10]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[11]  Keishi Tajima,et al.  Discovery and Retrieval of Logical Information Units in Web , 1999, WOWS.

[12]  Ramesh R. Sarukkai,et al.  Link prediction and path analysis using Markov chains , 2000, Comput. Networks.

[13]  Hans-Jörg Schek,et al.  Generating Vector Spaces On-the-fly for Flexible XML Retrieval , 2002 .

[14]  Sung-Hyon Myaeng,et al.  A flexible model for retrieval of SGML documents , 1998, SIGIR '98.

[15]  Ricardo A. Baeza-Yates,et al.  Integrating contents and structure in text retrieval , 1996, SGMD.

[16]  J. Kleinberg,et al.  Authoritative Soueces in a Hyper-linked Environment , 1998, SODA 1998.

[17]  C. Lee Giles,et al.  Self-Organization and Identification of Web Communities , 2002, Computer.

[18]  Padhraic Smyth,et al.  Model-Based Clustering and Visualization of Navigation Patterns on a Web Site , 2003, Data Mining and Knowledge Discovery.

[19]  David Carmel,et al.  JuruXML - an XML Retrieval System at INEX'02 , 2002, INEX Workshop.

[20]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[21]  Gregory Cobena,et al.  A comparative study for XML change detection , 2002, BDA.

[22]  Tat-Seng Chua,et al.  Hierarchical Indexing and Flexible Element Retrieval for Structured Document , 2003, ECIR.

[23]  Zhixiang Chen,et al.  Optimal Algorithms for Finding User Access Sessions from Very Large Web Logs , 2002, PAKDD.

[24]  Norbert Fuhr,et al.  XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[25]  Ravi Kothari,et al.  On using Page Cooccurrences for Computing Clickstream Similarity , 2003, SDM.

[26]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[27]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[28]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[29]  Elio Masciari,et al.  Detecting Structural Similarities between XML Documents , 2002, WebDB.

[30]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .

[31]  Qiang Yang,et al.  Correlation-based document clustering using web logs , 2001, Proceedings of the 34th Annual Hawaii International Conference on System Sciences.

[32]  Sergio Greco,et al.  Web Communities: Models and Algorithms , 2004, World Wide Web.

[33]  K. Vanhoof,et al.  Clustering navigation patterns on a website using a Sequence Alignment Method , 2001 .

[34]  J W Ballard,et al.  Data on the web? , 1995, Science.