A Page-Classification Approach to Web Usage Semantic Analysis

With the emergence of the World Wide Web, analyzing and improving Web communication has become essential to adapt the Web content to the visitors’ expectations. Web communication analysis is traditionally performed by Web analytics software, which produce long lists of page-based audience metrics. These results sufier from page synonymy, page polysemy, page temporality, and page volatility. In addition, the metrics contain little semantics and are too detailed to be exploited by organization managers and chief editors, who need summarized and conceptual information to take high-level decisions. To obtain such metrics, we propose to classify the Web site pages into categories representing the Web site topics and to aggregate the page hits accordingly. In this paper, we show how to compute and visualize these metrics using OLAP tools. To solve the page-temporality issue, we propose to classify the versions of the pages using support vector machines. To validate our approach, we perform experiments on real data with SQL Server OLAP Analysis Service, the R statistical tool, and our prototype WASA-PC. Finally, we compare our results against directory-based metrics and concept-based metrics.

[1]  J H Maindonald,et al.  Draft of Changes and Additions in a Projected 3rd Edition of Data Analysis and Graphics Using R , 2009 .

[2]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..

[3]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[4]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[5]  Esteban Zimányi,et al.  Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis , 2006, SOFSEM.

[6]  Ed H. Chi,et al.  Using information scent to model user information needs and actions and the Web , 2001, CHI.

[7]  Mark Sanderson,et al.  Word sense disambiguation and information retrieval , 1994, SIGIR '94.

[8]  Luis Alfonso Ureña López,et al.  Adaptive Selection of Base Classifiers in One-Against-All Learning for Large Multi-labeled Collections , 2004, EsTAL.

[9]  Organizations , 1992, Restoration & Management Notes.

[10]  Prabhakar Raghavan,et al.  Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies , 1998, The VLDB Journal.

[11]  Esteban Zimányi,et al.  Semantic analysis of web site audience , 2006, SAC.

[12]  Pier Luca Lanzi,et al.  Mining interesting knowledge from weblogs: a survey , 2005, Data Knowl. Eng..

[13]  Gerd Stumme,et al.  FCA-MERGE: Bottom-Up Merging of Ontologies , 2001, IJCAI.

[14]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[15]  Jean-Pierre Norguet,et al.  WebSphere Version 4 Application Development Handbook , 2002 .

[16]  Esteban Zimányi,et al.  OLAP Hierarchies: A Conceptual Perspective , 2004, CAiSE.

[17]  A. Cohen An Introduction to Probability Theory and Mathematical Statistics , 1979 .

[18]  Ralf Steinberger,et al.  Document Classification and Visualisation to Support the Investigation of Suspected Fraud , 2001 .

[19]  Esteban Zimányi,et al.  Topic-Based Audience Metrics for Internet Marketing by Combining Ontologies and Output Page Mining , 2005, International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC'06).

[20]  Terumasa Aoki,et al.  Using SOFM to Improve Web Site Text Content , 2005, ICNC.