Natural language processing for web browsing analytics: Challenges, lessons learned, and opportunities

Abstract In an Internet arena where the search engines and other digital marketing firms’ revenues peak, other actors still have open opportunities to monetize their users’ data. After the convenient anonymization, aggregation, and agreement, the set of websites users visit may result in exploitable data for ISPs. Uses cover from assessing the scope of advertising campaigns to reinforcing user fidelity among other marketing approaches, as well as security issues. However, sniffers based on HTTP, DNS, TLS or flow features do not suffice for this task. Modern websites are designed for preloading and prefetching some contents in addition to embedding banners, social networks’ links, images, and scripts from other websites. This self-triggered traffic makes it confusing to assess which websites users visited on purpose. Moreover, DNS caches prevent some queries of actively visited websites to be even sent. On this limited input, we propose to handle such domains as words and the sequences of domains as documents. This way, it is possible to identify the visited websites by translating this problem to a text classification context and applying the most promising techniques of the natural language processing and neural networks fields. After applying different representation methods such as TF–IDF, Word2vec, Doc2vec, and custom neural networks in diverse scenarios and with several datasets, we can state websites visited on purpose with accuracy figures over 90%, with peaks close to 100%, being processes that are fully automated and free of any human parametrization.

[1]  kc claffy,et al.  Ten Things Lawyers should Know about Internet Research , 2008 .

[2]  Pilsung Kang,et al.  Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec , 2019, Inf. Sci..

[3]  Yong Wang,et al.  Extracting user web browsing patterns from non-content network traces: The online advertising case study , 2012, Comput. Networks.

[4]  Tao Wang,et al.  Effective Attacks and Provable Defenses for Website Fingerprinting , 2014, USENIX Security Symposium.

[5]  Marco Mellia,et al.  DNS to the rescue: discerning content and services in a tangled web , 2012, IMC '12.

[6]  Daniel Morato,et al.  On the reduction of authoritative DNS cache timeouts: Detection and implications for user privacy , 2021, J. Netw. Comput. Appl..

[7]  Jie Yang,et al.  Kernelized support vector machine with deep learning: An efficient approach for extreme multiclass dataset , 2017, Pattern Recognit. Lett..

[8]  Zhenzhong Xu,et al.  Research on detection methods based on Doc2vec abnormal comments , 2018, Future Gener. Comput. Syst..

[9]  Yan Luo,et al.  Machine Learning Based Malware Detection on Encrypted Traffic: A Comprehensive Performance Study , 2020, NSysS.

[10]  Jieling Li,et al.  The Weighted Word2vec Paragraph Vectors for Anomaly Detection Over HTTP Traffic , 2020, IEEE Access.

[11]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[12]  Wouter Joosen,et al.  Automated Website Fingerprinting through Deep Learning , 2017, NDSS.

[13]  Javier Aracil,et al.  DNS weighted footprints for web browsing analytics , 2018, J. Netw. Comput. Appl..

[14]  Aleksander Smywinski-Pohl,et al.  Meta-User2Vec model for addressing the user and item cold-start problem in recommender systems , 2020, User Modeling and User-Adapted Interaction.

[15]  Kasper Green Larsen,et al.  Fully Understanding the Hashing Trick , 2018, NeurIPS.

[16]  Carmela Troncoso,et al.  Encrypted DNS -> Privacy? A Traffic Analysis Perspective , 2019, NDSS.

[17]  Jan Rüth,et al.  A First Look at QUIC in the Wild , 2018, PAM.

[18]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[19]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[20]  Marco Mellia,et al.  LENTA: Longitudinal Exploration for Network Traffic Analysis From Passive Data , 2019, IEEE Transactions on Network and Service Management.

[21]  K. Robert Lai,et al.  Dimensional Sentiment Analysis Using a Regional CNN-LSTM Model , 2016, ACL.

[22]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[23]  Durga Toshniwal,et al.  SmPFT: Social media based profile fusion technique for data enrichment , 2019, Comput. Networks.

[24]  Jiahai Yang,et al.  Domain-Embeddings Based DGA Detection with Incremental Training Method , 2020, 2020 IEEE Symposium on Computers and Communications (ISCC).

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  Srinivas Devadas,et al.  Var-CNN: A Data-Efficient Website Fingerprinting Attack Based on Deep Learning , 2018, Proc. Priv. Enhancing Technol..

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[29]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[30]  Dirk Grunwald,et al.  Legal issues surrounding monitoring during network research , 2007, IMC '07.

[31]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[32]  Paul Barford,et al.  Context-aware clustering of DNS query traffic , 2008, IMC '08.

[33]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[34]  Shigeki Goto,et al.  SFMap: Inferring Services over Encrypted Web Flows Using Dynamical Domain Name Graphs , 2015, TMA.

[35]  Mohsen Imani,et al.  Deep Fingerprinting: Undermining Website Fingerprinting Defenses with Deep Learning , 2018, CCS.

[36]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[37]  Martino Trevisan,et al.  Does domain name encryption increase users' privacy? , 2020, Comput. Commun. Rev..

[38]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[39]  Adrienne Porter Felt,et al.  Measuring HTTPS Adoption on the Web , 2017, USENIX Security Symposium.