Identifying user sessions from web server logs with integer programming

Web usage mining has proven to be an important advance for e-business systems, both by finding web user buying patterns and suggesting ways to improve web user navigation. A primary input for web usage mining is web user sessions that must be constructed from web server logs called sessionization when such sessions are not otherwise identified. We use bipartite cardinality matching and a more general integer program to construct sessions. We also propose several variations of our integer program to provide additional insights into session characteristics. For testing, we retrieve 15 months of web server logs and corresponding real sessions from an academic web site. We compare real sessions, results obtained by our optimization models, and results from a commonly-used timeout heuristic. We find our optimization models dominate the timeout heuristic using several comparison measures. Solution time for a typical month is seven hours for our integer program, 30 minutes for our bipartite cardinality matching, and about 1 minute for the heuristic. Although solution time is significantly greater for the integer program, its variations contribute additional analysis of web user behavior.

[1]  Myra Spiliopoulou,et al.  A Framework for the Evaluation of Session Reconstruction Heuristics in Web-Usage Analysis , 2003, INFORMS J. Comput..

[2]  Jason J. Jung,et al.  Semantic Outlier Analysis for Sessionizing Web Logs , 2003 .

[3]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[4]  D. Langford Internet Ethics , 2003 .

[5]  Myra Spiliopoulou,et al.  Measuring the Accuracy of Sessionizers for Web Usage Analysis , 2001 .

[6]  Pablo E. Román,et al.  Web User Session Reconstruction with Back Button Browsing , 2009, KES.

[7]  Charles V. Wright,et al.  On Web Browsing Privacy in Anonymized NetFlows , 2007, USENIX Security Symposium.

[8]  Jeremy D. Zawodny,et al.  High Performance MySQL , 2004 .

[9]  Huberman,et al.  Strong regularities in world wide web surfing , 1998, Science.

[10]  Andreas Hotho,et al.  Towards Semantic Web Mining , 2002, SEMWEB.

[11]  Anupam Joshi,et al.  On Mining Web Access Logs , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[12]  Steven Glassman,et al.  A Caching Relay for the World Wide Web , 1994, Comput. Networks ISDN Syst..

[13]  James E. Pitkow,et al.  In Search of Reliable Usage Data on the WWW , 1997, Comput. Networks.

[14]  Huang Hao,et al.  Separating Interleaved User Sessions from Web Log , 2011, 2011 International Conference on Network Computing and Information Security.

[15]  Sandeep Pandey,et al.  Recrawl scheduling based on information longevity , 2008, WWW.

[16]  Wen-Kang Jia,et al.  Challenge and solutions of NAT traversal for ubiquitous and pervasive applications on the Internet , 2009, J. Syst. Softw..

[17]  Murat Ali Bayir,et al.  Performance Comparison of Pattern Discovery Methods on Web Log Data , 2006, IEEE International Conference on Computer Systems and Applications, 2006..

[18]  Sankar K. Pal,et al.  Data mining in soft computing framework: a survey , 2002, IEEE Trans. Neural Networks.

[19]  Carsten Schneider,et al.  Computer proofs of a new family of harmonic number identities , 2003, Adv. Appl. Math..

[20]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[21]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[22]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition , 1997 .

[23]  Dror G. Feitelson,et al.  Distinguishing humans from robots in web search logs: preliminary results using query rates and intervals , 2009, WSCD '09.

[24]  Albert-László Barabási,et al.  Modeling bursts and heavy tails in human dynamics , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[25]  Tasawar Hussain,et al.  Web usage mining: A survey on preprocessing of web log file , 2010, 2010 International Conference on Information and Emerging Technologies.

[26]  V. Palade,et al.  Adaptive Web Sites - A Knowledge Extraction from Web Data Approach , 2008, Frontiers in Artificial Intelligence and Applications.

[27]  Ali A. Ghorbani,et al.  The reconstruction of user sessions from a server log using improved time-oriented heuristics , 2004, Proceedings. Second Annual Conference on Communication Networks and Services Research, 2004..

[28]  Ravindra K. Ahuja,et al.  Network Flows: Theory, Algorithms, and Applications , 1993 .

[29]  Robert F. Dell,et al.  Formulating Integer Linear Programs: A Rogues' Gallery , 2007, INFORMS Trans. Educ..

[30]  Olfa Nasraoui,et al.  Web Usage Mining , 2011 .

[31]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[32]  Michael R. Bussieck,et al.  General Algebraic Modeling System (GAMS) , 2004 .

[33]  Pablo E. Román,et al.  Web User Session Reconstruction Using Integer Programming , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[34]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[35]  Jaideep Srivastava,et al.  Data Preparation for Mining World Wide Web Browsing Patterns , 1999, Knowledge and Information Systems.

[36]  Charles Aulds Linux Apache Web Server Administration , 2000 .

[37]  Pier Luca Lanzi,et al.  Recent Developments in Web Usage Mining Research , 2003, DaWaK.