Web usage mining based on probabilistic latent semantic analysis

The primary goal of Web usage mining is the discovery of patterns in the navigational behavior of Web users. Standard approaches, such as clustering of user sessions and discovering association rules or frequent navigational paths, do not generally provide the ability to automatically characterize or quantify the unobservable factors that lead to common navigational patterns. It is, therefore, necessary to develop techniques that can automatically discover hidden semantic relationships among users as well as between users and Web objects. Probabilistic Latent Semantic Analysis (PLSA) is particularly useful in this context, since it can uncover latent semantic associations among users and pages based on the co-occurrence patterns of these pages in user sessions. In this paper, we develop a unified framework for the discovery and analysis of Web navigational patterns based on PLSA. We show the flexibility of this framework in characterizing various relationships among users and Web objects. Since these relationships are measured in terms of probabilities, we are able to use probabilistic inference to perform a variety of analysis tasks such as user segmentation, page classification, as well as predictive tasks such as collaborative recommendations. We demonstrate the effectiveness of our approach through experiments performed on real-world data sets.

[1]  Jaideep Srivastava,et al.  Data Preparation for Mining World Wide Web Browsing Patterns , 1999, Knowledge and Information Systems.

[2]  Tao Luo,et al.  Discovery and Evaluation of Aggregate Usage Profiles for Web Personalization , 2004, Data Mining and Knowledge Discovery.

[3]  Myra Spiliopoulou,et al.  A Framework for the Evaluation of Session Reconstruction Heuristics in Web-Usage Analysis , 2003, INFORMS J. Comput..

[4]  David Cohn,et al.  Learning to Probabilistically Identify Authoritative Documents , 2000, ICML.

[5]  Andrew E. Fano,et al.  Building Recommender Systems using a Knowledge Base of Product Semantics , 2002 .

[6]  Byoung-Tak Zhang,et al.  An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition , 2003, PAKDD.

[7]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[8]  Thorsten Brants,et al.  Topic-based document segmentation with probabilistic latent semantic analysis , 2002, CIKM '02.

[9]  Kris Popat,et al.  A Hierarchical Model for Clustering and Categorising Documents , 2002, ECIR.

[10]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[11]  Jaideep Srivastava,et al.  Automatic personalization based on Web usage mining , 2000, CACM.

[12]  Pedro M. Domingos,et al.  Relational Markov models and their application to adaptive web navigation , 2002, KDD.

[13]  Tao Luo,et al.  Integrating Web Usage and Content Mining for More Effective Personalization , 2000, EC-Web.

[14]  Jaideep Srivastava,et al.  Web mining: information and pattern discovery on the World Wide Web , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[15]  Rajesh Parekh,et al.  Lessons and Challenges from Mining Retail E-Commerce Data , 2004, Machine Learning.

[16]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[17]  Ramesh R. Sarukkai,et al.  Link prediction and path analysis using Markov chains , 2000, Comput. Networks.

[18]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[19]  Georgios Paliouras,et al.  Web Usage Mining as a Tool for Personalization: A Survey , 2003, User Modeling and User-Adapted Interaction.

[20]  Jaideep Srivastava,et al.  Creating adaptive Web sites through usage-based clustering of URLs , 1999, Proceedings 1999 Workshop on Knowledge and Data Engineering Exchange (KDEX'99) (Cat. No.PR00453).

[21]  Michael I. Jordan,et al.  Unsupervised Learning from Dyadic Data , 1998 .

[22]  Thorsten Brants,et al.  Finding Similar Documents in Document Collections , 2002 .

[23]  Ramakrishnan Srikant,et al.  Mining web logs to improve website organization , 2001, WWW '01.

[24]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[25]  Peter Pirolli,et al.  Mining Longest Repeating Subsequences to Predict World Wide Web Surfing , 1999, USENIX Symposium on Internet Technologies and Systems.

[26]  Oren Etzioni,et al.  Adaptive Web Sites: Automatically Synthesizing Web Pages , 1998, AAAI/IAAI.

[27]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[28]  James McKee,et al.  Towards Zero-Input Personalization: Referrer-Based Page Prediction , 2000, AH.

[29]  Anupam Joshi,et al.  Automatic Web User Profiling and Personalization Using Robust Fuzzy Relational Clustering , 2002 .

[30]  Richard A. Harshman,et al.  Indexing by latent semantic indexing , 1990 .

[31]  Michael D. Smith,et al.  Using Path Profiles to Predict HTTP Requests , 1998, Comput. Networks.

[32]  Myra Spiliopoulou,et al.  Web usage mining for Web site evaluation , 2000, CACM.

[33]  Bamshad Mobasher,et al.  Using Ontologies to Discover Domain-Level Web Usage Profiles , 2002 .

[34]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..