Can web pages be classified using anonymized TCP/IP headers?

Web page classification is useful in many domains- including ad targeting, traffic modeling, and intrusion detection. In this paper, we investigate whether learning-based techniques can be used to classify web pages based only on anonymized TCP/IP headers of traffic generated when a web page is visited. We do this in three steps. First, we select informative TCP/IP features for a given downloaded web page, and study which of these remain stable over time and are also consistent across client browser platforms. Second, we use the selected features to evaluate four different labeling schemes and learning-based classification methods for web page classification. Lastly, we empirically study the effectiveness of the classification methods for real-world applications.

[1]  Jasleen Kaur,et al.  Comparing In-Browser Methods of Measuring Resource Load Times , 2012 .

[2]  Mark Crovella,et al.  Proceedings of the Seventh COnference on emerging Networking EXperiments and Technologies , 2011 .

[3]  Luca Salgarelli,et al.  On the stability of the information carried by traffic flow features at the packet level , 2009, CCRV.

[4]  Milton L. Mueller,et al.  Deep Packet Inspection: Effects of Regulation on Its Deployment by Internet Providers , 2013 .

[5]  Michalis Faloutsos,et al.  Internet traffic classification demystified: myths, caveats, and the best practices , 2008, CoNEXT '08.

[6]  Jasleen Kaur,et al.  On the Variation in Web Page Download Traffic across Different Client Types , 2014, 2014 IEEE 22nd International Conference on Network Protocols.

[7]  Yanghee Choi,et al.  Internet traffic classification demystified: on the sources of the discriminative power , 2010, CoNEXT.

[8]  Ion Stoica,et al.  HTTP as the narrow waist of the future internet , 2010, Hotnets-IX.

[9]  Farnam Jahanian,et al.  Internet inter-domain traffic , 2010, SIGCOMM '10.

[10]  Wen Zhang,et al.  How much can behavioral targeting help online advertising? , 2009, WWW '09.

[11]  Vyas Sekar,et al.  Understanding website complexity: measurements, metrics, and implications , 2011, IMC '11.

[12]  Walter Willinger,et al.  Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference , 2011, IMC 2011.

[13]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[14]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[15]  Walid Dabbous,et al.  Network characteristics of video streaming traffic , 2011, CoNEXT '11.

[16]  Lili Qiu,et al.  Statistical identification of encrypted Web browsing traffic , 2002, Proceedings 2002 IEEE Symposium on Security and Privacy.

[17]  Christopher Krügel,et al.  PUBCRAWL: Protecting Users and Businesses from CRAWLers , 2012, USENIX Security Symposium.

[18]  Kevin Jeffay,et al.  The Continued Evolution of Web Traffic , 2013, 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems.

[19]  Yong Wang,et al.  ISP-Enabled Behavioral Ad Targeting without Deep Packet Inspection , 2010, 2010 Proceedings IEEE INFOCOM.

[20]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[21]  Gang Wang,et al.  Northeastern University , 2021, IEEE Pulse.

[22]  Qiang Xu,et al.  Identifying diverse usage behaviors of smartphone apps , 2011, IMC '11.

[23]  2015 IEEE Conference on Computer Communications, INFOCOM 2015, Kowloon, Hong Kong, April 26 - May 1, 2015 , 2015, IEEE Conference on Computer Communications.

[24]  Carey L. Williamson,et al.  Identifying and discriminating between web and peer-to-peer traffic in the network core , 2007, WWW '07.

[25]  Wolfgang Mühlbauer,et al.  Digging into HTTPS: flow-based classification of webmail traffic , 2010, IMC '10.

[26]  Ravi Kumar,et al.  Are web users really Markovian? , 2012, WWW.

[27]  Michele C. Weigle,et al.  Tmix: a tool for generating realistic TCP application workloads in ns-2 , 2006, CCRV.

[28]  Xin Huang,et al.  Browser Fingerprinting from Coarse Traffic Summaries: Techniques and Implications , 2009, DIMVA.

[29]  Vivek S. Pai,et al.  Towards understanding modern web traffic , 2011, SIGMETRICS '11.

[30]  Phillip A. Porras,et al.  Clear and Present Data: Opaque Traffic and its Security Implications for the Future , 2013, NDSS.

[31]  Christopher Krügel,et al.  BotFinder: finding bots in network traffic without deep packet inspection , 2012, CoNEXT '12.

[32]  Monika Henzinger,et al.  Purely URL-based topic classification , 2009, WWW '09.

[33]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..