Exploiting Clustering Techniques for Web User-session Inference

We focus on the definition and identification of “Web user-sessions”, an aggregation of several TCP connections generated by the same source host on the basis of TCP connection opening time. The identification of a user session is non trivial; traditional approaches rely on threshold based mechanisms, which are very sensitive to the value assumed for the threshold, which may be difficult to correctly set. By applying clustering techniques, we define a novel methodology to identify Web user-sessions without requiring an a priori definition of threshold values. We discuss pros and cons of this approach, and we define a methodology to be applied to real traffic traces. The proposed methodology is evaluated on artificially generated traces to show its benefits against traditional threshold based approaches. We then analyze the characteristics of user sessions extracted from real traces, studying the statistical properties of the identified sessions.

[1]  Alan Weiss,et al.  A compound model for TCP connection arrivals for LAN and WAN applications , 2002, Comput. Networks.

[2]  Peter B. Danzig,et al.  tcplib: A Library of TCP Internetwork Traffic Characteristics , 2002 .

[3]  Deborah Estrin,et al.  An Empirical Workload Model for Driving Wide-Area TCP/IP Network Simulations , 2001 .

[4]  Vern Paxson,et al.  Empirically derived analytic models of wide-area TCP connections , 1994, TNET.

[5]  Azer Bestavros,et al.  Self-similarity in World Wide Web traffic: evidence and possible causes , 1996, SIGMETRICS '96.

[6]  Thomas D. Sandry,et al.  Introductory Statistics With R , 2003, Technometrics.

[7]  Marco Mellia,et al.  Measuring IP and TCP behavior on edge nodes with Tstat , 2005, Comput. Networks.

[8]  Alan Weiss,et al.  A Compound Model for TCP Connection Arrivals , 2000 .

[9]  Peter B. Danzig,et al.  Characteristics of wide-area TCP/IP conversations , 1991, SIGCOMM '91.

[10]  Walter Willinger,et al.  Self-similarity and heavy tails: structural modeling of network traffic , 1998 .

[11]  Marco Mellia,et al.  Measuring IP and TCP behavior on edge nodes , 2002, Global Telecommunications Conference, 2002. GLOBECOM '02. IEEE.

[12]  Thomas Bonald,et al.  Insensitivity results in statistical bandwidth sharing , 2001 .

[13]  Sally Floyd,et al.  Wide-area traffic: the failure of Poisson modeling , 1994 .

[14]  Nick McKeown,et al.  Monitoring very high speed links , 2001, IMW '01.

[15]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.