Web mining: information and pattern discovery on the World Wide Web

Application of data mining techniques to the World Wide Web, referred to as Web mining, has been the focus of several recent research projects and papers. However, there is no established vocabulary, leading to confusion when comparing research efforts. The term Web mining has been used in two distinct ways. The first, called Web content mining in this paper, is the process of information discovery from sources across the World Wide Web. The second, called Web usage mining, is the process of mining for user browsing and access patterns. We define Web mining and present an overview of the various research issues, techniques, and development efforts. We briefly describe WEBMINER, a system for Web usage mining, and conclude the paper by listing research issues.

[1]  Casimir A. Kulikowski,et al.  Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems , 1990 .

[2]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[3]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[4]  Jiawei Han,et al.  Data-Driven Discovery of Quantitative Rules in Relational Databases , 1993, IEEE Trans. Knowl. Data Eng..

[5]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[6]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[7]  Krishna Bharat,et al.  WEBVIZ: A Tool for World Wide Web Access Log Analysis , 1994 .

[8]  John Riedl,et al.  GroupLens: an open architecture for collaborative filtering of netnews , 1994, CSCW '94.

[9]  R. Ng,et al.  Eecient and Eeective Clustering Methods for Spatial Data Mining , 1994 .

[10]  R. Agarwal Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[11]  Martijn Koster,et al.  ALIWEB - Archie-like Indexing in the WEB , 1994, Comput. Networks ISDN Syst..

[12]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[13]  Peter B. Danzig,et al.  The Harvest Information Discovery and Access System , 1995, Comput. Networks ISDN Syst..

[14]  Kristian J. Hammond,et al.  FAQ finder: a case-based approach to knowledge navigation , 1995, Proceedings the 11th Conference on Artificial Intelligence for Applications.

[15]  Heikki Mannila,et al.  Discovering Frequent Episodes in Sequences , 1995, KDD.

[16]  Jennifer Widom,et al.  Querying Semistructured Heterogeneous Information , 1995, J. Syst. Integr..

[17]  Pattie Maes,et al.  Social information filtering: algorithms for automating “word of mouth” , 1995, CHI '95.

[18]  Oren Etzioni,et al.  Category Translation: Learning to Understand Information on the Internet , 1995, IJCAI.

[19]  Thorsten Joachims,et al.  WebWatcher : A Learning Apprentice for the World Wide Web , 1995 .

[20]  Divesh Srivastava,et al.  The Information Manifold , 1995 .

[21]  David Konopnicki,et al.  W3QS: A Query System for the World-Wide Web , 1995, VLDB.

[22]  Jiawei Han,et al.  Resource and Knowledge Discovery in Global Information Systems: A Preliminary Design and Experiment , 1995, KDD.

[23]  Jaideep Srivastava,et al.  Myriad: Design and implementation of a federated database prototype , 1995, Softw. Pract. Exp..

[24]  Jeffrey F. Naughton,et al.  On the Computation of Multidimensional Aggregates , 1996, VLDB.

[25]  Michael J. Pazzani,et al.  Syskill & Webert: Identifying Interesting Web Sites , 1996, AAAI/IAAI, Vol. 1.

[26]  Jeffrey F. Naughton,et al.  Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies , 1996, VLDB.

[27]  Daniel S. Weld,et al.  Planning to gather inforrnation , 1996, AAAI 1996.

[28]  Ramana Rao,et al.  Silk from a sow's ear: extracting usable structures from the Web , 1996, CHI.

[29]  Israel Ben-Shaul,et al.  Automatically Organizing Bookmarks per Contents , 1996, Comput. Networks.

[30]  Dan Suciu,et al.  A query language and optimization techniques for unstructured data , 1996, SIGMOD '96.

[31]  Jaideep Srivastava,et al.  Web Mining: Pattern Discovery from World Wide Web Transactions , 1996 .

[32]  Venky Harinarayan,et al.  Implementing Data Cubes E ciently , 1996 .

[33]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[34]  Roger L. King,et al.  Supporting Information Infrastructure for Distributed, Heterogeneous Knowledge Discovery , 1996 .

[35]  Chanathip Namprempre,et al.  HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering , 1996, HYPERTEXT '96.

[36]  Daniel S. Weld,et al.  Planning to Gather Information , 1996, AAAI/IAAI, Vol. 1.

[37]  Laks V. S. Lakshmanan,et al.  A declarative language for querying and restructuring the Web , 1996, Proceedings RIDE '96. Sixth International Workshop on Research Issues in Data Engineering.

[38]  Philip S. Yu,et al.  Data mining for path traversal patterns in a web environment , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[39]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[40]  Jaideep Srivastava,et al.  Grouping Web page references into transactions for mining World Wide Web browsing patterns , 1997, Proceedings 1997 IEEE Knowledge and Data Engineering Exchange Workshop.

[41]  James E. Pitkow,et al.  In Search of Reliable Usage Data on the WWW , 1997, Comput. Networks.

[42]  Chia-Hui Chang,et al.  Customizable Multi-Engine Search Tool with Clustering , 1997, Comput. Networks.

[43]  Yoav Shoham,et al.  An Adaptive Agent for Automated Web Browsing , 1997 .

[44]  Oren Etzioni,et al.  A scalable comparison-shopping agent for the World-Wide Web , 1997, AGENTS '97.

[45]  William F. Punch,et al.  Finding Salient Features for Personal Web Page Categories , 1997, Comput. Networks.

[46]  Paolo Merialdo,et al.  Semistructured and structured data in the Web: going back and forth , 1997, SGMD.

[47]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[48]  Curtis E. Dyreson Using an Incomplete Data Cube as a Summary Data Sieve , 1997, IEEE Data Eng. Bull..

[49]  Ellen Spertus,et al.  ParaSite: Mining Structural Information on the Web , 1997, Comput. Networks.

[50]  David Konopnicki,et al.  Information gathering in the World-Wide Web: the W3QL query language and the W3QS system , 1998, TODS.