Detecting user actions from HTTP traces: Toward an automatic approach

Detecting explicit user actions, i.e., requests for web pages such as hyper-link clicks, from passive traces is fundamental for many applications, such as network forensics or content popularity estimation. Every URL explicitly visited by a user usually triggers further automatic URL requests to obtain all objects that compose the web page. HTTP traces provide a summary of all URLs requested by users, but no information that could be used to separate explicit from automatic requests. Previous works have targeted this problem and ad-hoc heuristics have been proposed. Validation has been typically done using synthetic traces. This paper investigates whether an approach based solely on machine learning can successfully detect user actions from HTTP traces. A machine learning approach would come with many advantages - e.g., it minimizes manual tuning of parameters and can easily adapt to page structure changes. We build both real and synthetic traces to assess the performance and gain insights on the features that bring most advantages in classification. Our results show that machine learning reaches similar or better performance as previous heuristics. Furthermore, we show that models built with machine learning algorithms are robust, presenting consistent performance in different scenarios.

[1]  Ravi Kumar,et al.  A characterization of online browsing behavior , 2010, WWW '10.

[2]  Michalis Faloutsos,et al.  Internet traffic classification demystified: myths, caveats, and the best practices , 2008, CoNEXT '08.

[3]  Michalis Faloutsos,et al.  ReSurf: Reconstructing web-surfing activity from network traffic , 2013, 2013 IFIP Networking Conference.

[4]  Anja Feldmann,et al.  Understanding online social network usage from a network perspective , 2009, IMC '09.

[5]  Vivek S. Pai,et al.  Towards understanding modern web traffic , 2011, SIGMETRICS '11.

[6]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[7]  Aiko Pras,et al.  Flow Monitoring Explained: From Packet Capture to Data Analysis With NetFlow and IPFIX , 2014, IEEE Communications Surveys & Tutorials.

[8]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[9]  Elena Baralis,et al.  Macroscopic view of malware in home networks , 2015, 2015 12th Annual IEEE Consumer Communications and Networking Conference (CCNC).

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  Marco Mellia,et al.  Gold Mining in a River of Internet Content Traffic , 2014, TMA.

[12]  Dario Rossi,et al.  Experiences of Internet traffic monitoring with tstat , 2011, IEEE Network.

[13]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.