Hybrid and lightweight detection of third party tracking: Design, implementation, and evaluation

Abstract A common practice for websites is to rely on services provided by third party sites to track users and provide personalized experiences. Unfortunately, this practice has strong implications for both users and performance. From one hand, the privacy of individuals is at a risk given the use of valuable information used for the reconstruction of personal profiles. From the other hand, many existing countermeasures to protect privacy, having been implemented into Web browsers, exhibit performance issues, mainly due to the use of huge (and difficult to maintain up to date) lists of resources that have to be filtered out, given their privacy intrusiveness. To overcome these limitations, we propose the use of a hybrid mechanism exploiting blacklisting and machine learning for the automatic identification of privacy intrusive services requested while browsing Web pages. The idea is to use the blacklisting technique (widely used by the majority of privacy tools), in combination with a machine learning model which distinguishes between malicious and functional resources, and hence updates the blacklist, accordingly. We found out that machine learning models are able to classify JavaScript programs and HTTP requests with accuracy up to 91% and 97%, respectively. We provided a prototype implementation of this hybrid mechanism, named GuardOne, and we performed an exhaustive evaluation study to assess its effectiveness and performance. Results showed that GuardOne is able to filter out malicious resources from users’ requests without performance degradation when compared with traditional systems that leverage on the use of static lists for filtering. Moreover, results about effectiveness show that our mechanism, even with some small improvements, is able to efficiently filter out malicious requests and reduce in a substantial way personal information leakage.

[1]  Jun Zhao,et al.  Third Party Tracking in the Mobile Ecosystem , 2018, WebSci.

[2]  Helen Nissenbaum,et al.  Adnostic: Privacy Preserving Targeted Advertising , 2010, NDSS.

[3]  Javier Parra-Arnau,et al.  Pay-per-tracking: A collaborative masking model for web browsing , 2017, Inf. Sci..

[4]  Dan Boneh,et al.  Who killed my battery?: analyzing mobile browser energy consumption , 2012, WWW.

[5]  Balachander Krishnamurthy,et al.  Measuring privacy loss and the impact of privacy protection in web browsing , 2007, SOUPS '07.

[6]  Nicola Lettieri,et al.  The Conundrum of Success in Music: Playing it or Talking About it? , 2019, IEEE Access.

[7]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[8]  Colin J. Bennett Cookies, web bugs, webcams and cue cats: Patterns of surveillance on the world wide web , 2001, Ethics and Information Technology.

[9]  David Sánchez,et al.  Privacy-preserving and advertising-friendly web surfing , 2018, Comput. Commun..

[10]  Arvind Narayanan,et al.  The Web Never Forgets: Persistent Tracking Mechanisms in the Wild , 2014, CCS.

[11]  Balaji Padmanabhan,et al.  A process model for information retrieval context learning and knowledge discovery , 2015, Artificial Intelligence and Law.

[12]  Balachander Krishnamurthy,et al.  On the leakage of personally identifiable information via online social networks , 2009, CCRV.

[13]  Alberto Negro,et al.  Privacy as a proxy for Green Web browsing: Methodology and experimentation , 2017, Comput. Networks.

[14]  Aijun An,et al.  Detection of malicious and non-malicious website visitors using unsupervised neural network learning , 2013, Appl. Soft Comput..

[15]  Balachander Krishnamurthy,et al.  Privacy awareness about information leakage: who knows what about me? , 2013, WPES.

[16]  Hyejin Kim,et al.  Perceived Relevance and Privacy Concern Regarding Online Behavioral Advertising (OBA) and Their Role in Consumer Responses , 2017 .

[17]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[18]  Sebastian Zander,et al.  A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification , 2006, CCRV.

[19]  Gang Wang,et al.  Crime data mining: a general framework and some examples , 2004, Computer.

[20]  Yuqing Zhang,et al.  TrackerDetector: A system to detect third-party trackers through machine learning , 2015, Comput. Networks.

[21]  Alberto Negro,et al.  Mobile phone batteries draining: Is green web browsing the solution? , 2014, International Green Computing Conference.

[22]  Chris Kanich,et al.  Leveraging Machine Learning to Improve Unwanted Resource Filtering , 2014, AISec '14.

[23]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[24]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[25]  Jianping Yin,et al.  Malicious Codes Detection Based on Ensemble Learning , 2007, ATC.

[26]  G. Klopman Artificial intelligence approach to structure-activity studies. Computer automated structure evaluation of biological activity of organic molecules , 1985 .

[27]  Milton S. Boyd,et al.  Designing a neural network for forecasting financial and economic time series , 1996, Neurocomputing.

[28]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[29]  Edgar R. Weippl,et al.  Block Me If You Can: A Large-Scale Study of Tracker-Blocking Tools , 2017, 2017 IEEE European Symposium on Security and Privacy (EuroS&P).

[30]  Arash Asfaram,et al.  Application of machine/statistical learning, artificial intelligence and statistical experimental design for the modeling and optimization of methylene blue and Cd(ii) removal from a binary aqueous solution by natural walnut carbon. , 2017, Physical chemistry chemical physics : PCCP.

[31]  Yuta Saito,et al.  On Estimating Platforms of Web User with JavaScript Math Object , 2018, NBiS.

[32]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[33]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[34]  Delfina Malandrino,et al.  An Evolutionary Composer for Real-Time Background Music , 2016, EvoMUSART.

[35]  Michalis Faloutsos,et al.  TrackAdvisor: Taking Back Browsing Privacy from Third-Party Trackers , 2015, PAM.

[36]  Zhenkai Liang,et al.  Tracking the Trackers: Fast and Scalable Dynamic Analysis of Web Content for Privacy Violations , 2012, ACNS.

[37]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[38]  Balachander Krishnamurthy,et al.  WWW 2009 MADRID! Track: Security and Privacy / Session: Web Privacy Privacy Diffusion on the Web: A Longitudinal Perspective , 2022 .

[39]  M. N. Sulaiman,et al.  A Review On Evaluation Metrics For Data Classification Evaluations , 2015 .

[40]  Wouter Joosen,et al.  Cookieless Monster: Exploring the Ecosystem of Web-Based Device Fingerprinting , 2013, 2013 IEEE Symposium on Security and Privacy.

[41]  Alessandro Acquisti,et al.  Information revelation and privacy in online social networks , 2005, WPES '05.

[42]  Hailin Wu,et al.  Hidden surveillance by Web sites: Web bugs in contemporary use , 2003, CACM.

[43]  Chih-Fong Tsai,et al.  Using neural network ensembles for bankruptcy prediction and credit scoring , 2008, Expert Syst. Appl..

[44]  Silvestro Micera,et al.  Control of Multifunctional Prosthetic Hands by Processing the Electromyographic Signal. , 2017, Critical reviews in biomedical engineering.

[45]  Ghazaleh Beigi,et al.  Protecting User Privacy: An Approach for Untraceable Web Browsing History and Unambiguous User Profiles , 2018, WSDM.

[46]  Frédéric Thiesse,et al.  Leveraging Text Mining for the Design of a Legal Knowledge Management System , 2017, ECIS.

[47]  Balachander Krishnamurthy,et al.  Towards Seamless Tracking-Free Web: Improved Detection of Trackers via One-class Learning , 2016, Proc. Priv. Enhancing Technol..

[48]  Delfina Malandrino,et al.  A Kind of Bio-inspired Learning of mUsic stylE , 2017, EvoMUSART.

[49]  Edward W. Felten,et al.  Cookies That Give You Away: The Surveillance Implications of Web Tracking , 2015, WWW.

[50]  Balachander Krishnamurthy,et al.  Web Protocols and Practice - HTTP/1.1, Networking Protocols, Caching, and Traffic Measurement , 2001 .

[51]  Guanglin Li,et al.  An adaptation strategy of using LDA classifier for EMG pattern recognition , 2013, 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[52]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[53]  Alfredo Cuzzocrea,et al.  A machine-learning framework for supporting intelligent web-phishing detection and analysis , 2019, IDEAS.

[54]  Abdulhamit Subasi,et al.  Classification of EEG signals using neural network and logistic regression , 2005, Comput. Methods Programs Biomed..

[55]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[56]  Lorrie Faith Cranor,et al.  Can Users Control Online Behavioral Advertising Effectively? , 2012, IEEE Security & Privacy.

[57]  Michiel Debruyne,et al.  An outlier map for Support Vector Machine classification , 2010 .

[58]  Vittorio Scarano,et al.  Privacy leakage on the Web: Diffusion and countermeasures , 2013, Comput. Networks.

[59]  Sorin Lerner,et al.  An empirical study of privacy-violating information flows in JavaScript web applications , 2010, CCS '10.

[60]  Andrew C. Simpson,et al.  Privacy-Preserving Targeted Mobile Advertising: Formal Models and Analysis , 2016, DPM/QASA@ESORICS.

[61]  Akira Yamada,et al.  Web Tracking Site Detection Based on Temporal Link Analysis , 2010, 2010 IEEE 24th International Conference on Advanced Information Networking and Applications Workshops.

[62]  Arun Chauhan,et al.  An approach for identifying JavaScript-loaded advertisements through static program analysis , 2012, WPES '12.

[63]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[64]  Claude Castelluccia,et al.  How Unique and Traceable Are Usernames? , 2011, PETS.

[65]  Roberto De Prisco,et al.  A Neural Network for Bass Functional Harmonization , 2010, EvoApplications.

[66]  Howie Choset,et al.  Subdimensional expansion for multirobot path planning , 2015, Artif. Intell..

[67]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[68]  Claude Castelluccia,et al.  MyAdChoices: Bringing Transparency and Control to Online Advertising , 2016, ACM Trans. Web.

[69]  Saikat Guha,et al.  Privad: Practical Privacy in Online Advertising , 2011, NSDI.

[70]  Hamed Haddadi,et al.  MobiAd: private and scalable mobile advertising , 2010, MobiArch '10.

[71]  Catherine Dwyer Privacy in the Age of Google and Facebook , 2011, IEEE Technology and Society Magazine.

[72]  Zhen Han,et al.  Dynamic Privacy Leakage Analysis of Android Third-Party Libraries , 2018, 2018 1st International Conference on Data Intelligence and Security (ICDIS).

[73]  Bernhard Ager,et al.  An Automated Approach for Complementing Ad Blockers’ Blacklists , 2015, Proc. Priv. Enhancing Technol..

[74]  Hassan Jameel Asghar,et al.  Touch and You’re Trapp(ck)ed: Quantifying the Uniqueness of Touch Gestures for Tracking , 2018, Proc. Priv. Enhancing Technol..

[75]  Kevin B. Englehart,et al.  A robust, real-time control scheme for multifunction myoelectric control , 2003, IEEE Transactions on Biomedical Engineering.

[76]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[77]  Craig E. Wills,et al.  What Ad Blockers Are (and Are Not) Doing , 2016, 2016 Fourth IEEE Workshop on Hot Topics in Web Systems and Technologies (HotWeb).

[78]  Yang Wang,et al.  Why Johnny can't opt out: a usability evaluation of tools to limit online behavioral advertising , 2012, CHI.

[79]  A. Izenman Linear Discriminant Analysis , 2013 .

[80]  Nicholas Kushmerick,et al.  Learning to remove Internet advertisements , 1999, AGENTS '99.

[81]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[82]  Claude Castelluccia,et al.  On the uniqueness of Web browsing history patterns , 2014, Ann. des Télécommunications.

[83]  T. Funabashi,et al.  One-Hour-Ahead Load Forecasting Using Neural Networks , 2002 .

[84]  Yaping Zang,et al.  Advances of flexible pressure sensors toward artificial intelligence and health care applications , 2015 .

[85]  Vitaly Shmatikov,et al.  De-anonymizing Social Networks , 2009, 2009 30th IEEE Symposium on Security and Privacy.

[86]  Vittorio Scarano,et al.  Supportive, Comprehensive and Improved Privacy Protection for Web Browsing , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[87]  Srdjan Capkun,et al.  Quantifying Web Adblocker Privacy , 2017, ESORICS.