Using probabilistic latent semantic analysis for Web page grouping

The locality of Web pages within a Web site is initially determined by the designer's expectation. Web usage mining can discover the patterns in the navigational behaviour of Web visitors, in turn, improve Web site functionality and service designing by considering users' actual opinion. Conventional Web page clustering technique is often utilized to reveal the functional similarity of Web pages. However, high-dimensional computation problem will be incurred due to taking user transaction as dimension. In this paper, we propose a new Web page grouping approach based on a probabilistic latent semantic analysis (PLSA) model. An iterative algorithm based on maximum likelihood principle is employed to overcome the aforementioned computational shortcoming. The Web pages are classified into various groups according to user access patterns. Meanwhile, the semantic latent factors or tasks are characterized by extracting the content of "dominant" pages related to the factors. We demonstrate the effectiveness of our approach by conducting experiments on real world data sets.

[1]  Yanchun Zhang,et al.  Constructing Good Quality Web Page Communities , 2002, Australasian Database Conference.

[2]  Thomas Hofmann,et al.  Latent semantic models for collaborative filtering , 2004, TOIS.

[3]  Oren Etzioni,et al.  Adaptive Web Sites: Automatically Synthesizing Web Pages , 1998, AAAI/IAAI.

[4]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[5]  Yanchun Zhang,et al.  Utilizing Hyperlink Transitivity to Improve Web Page Clustering , 2003, ADC.

[6]  Vipin Kumar,et al.  Hypergraph Based Clustering in High-Dimensional Data Sets: A Summary of Results , 1998, IEEE Data Eng. Bull..

[7]  Yanchun Zhang,et al.  Effectively Finding Relevant Web Pages from Linkage Information , 2003, IEEE Trans. Knowl. Data Eng..

[8]  Henry Lieberman,et al.  Letizia: An Agent That Assists Web Browsing , 1995, IJCAI.

[9]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[10]  T. Joachims WebWatcher : A Tour Guide for the World Wide Web , 1997 .

[11]  Xin Jin,et al.  A Recommendation Model Based on Latent Principal Factors in Web Navigation Data , 2004, WebDyn@WWW.

[12]  Yanchun Zhang,et al.  Discovering User Access Pattern Based on Probabilistic Latent Factor Model , 2005, ADC.

[13]  Oren Etzioni,et al.  Adaptive Web sites , 2000, CACM.

[14]  Charu C. Aggarwal,et al.  A Tree Projection Algorithm for Generation of Frequent Item Sets , 2001, J. Parallel Distributed Comput..

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[16]  Tao Luo,et al.  Discovery and Evaluation of Aggregate Usage Profiles for Web Personalization , 2004, Data Mining and Knowledge Discovery.

[17]  Bamshad Mobasher,et al.  Web Usage Mining and Personalization , 2004, The Practical Handbook of Internet Computing.

[18]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[19]  Jaideep Srivastava,et al.  Creating adaptive Web sites through usage-based clustering of URLs , 1999, Proceedings 1999 Workshop on Knowledge and Data Engineering Exchange (KDEX'99) (Cat. No.PR00453).

[20]  Thorsten Joachims,et al.  Web Watcher: A Tour Guide for the World Wide Web , 1997, IJCAI.

[21]  Yanchun Zhang,et al.  Measuring similarity of interests for clustering web-users , 2001, ADC.

[22]  Edith Cohen,et al.  Improving end-to-end performance of the Web using server volumes and proxy filters , 1998, SIGCOMM '98.

[23]  Xindong Wu,et al.  SiteHelper: A Localized Agent That Helps Incremental Exploration of the World Wide Web , 1997, Comput. Networks.

[24]  Oren Etzioni,et al.  Adaptive Web Sites: Conceptual Cluster Mining , 1999, IJCAI.

[25]  Maurice D. Mulvenna,et al.  Discovering Internet marketing intelligence through online analytical web usage mining , 1998, SGMD.

[26]  Matthias Jarke,et al.  20th VLDB Conference, September 12-15, 1994, Santiago-Chile : proceedings of the 20th International Conference on Very Large Data Bases , 1994 .

[27]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[28]  Bamshad Mobasher,et al.  A Unified Approach to Personalization Based on Probabilistic Latent Semantic Models of Web Usage and Content , 2004 .

[29]  David Cohn,et al.  Learning to Probabilistically Identify Authoritative Documents , 2000, ICML.