LookAhead: Augmenting Crowdsourced Website Reputation Systems with Predictive Modeling

Unsafe websites consist of malicious as well as inappropriate sites, such as those hosting questionable or offensive content. Website reputation systems are intended to help ordinary users steer away from these unsafe sites. However, the process of assigning safety ratings for websites typically involves humans. Consequently it is time consuming, costly and not scalable. This has resulted in two major problems: (i) a significant proportion of the web space remains unrated and (ii) there is an unacceptable time lag before new websites are rated. In this paper, we show that by leveraging structural and content-based properties of websites, we can reliably and efficiently predict their safety ratings, thereby mitigating both problems. We demonstrate the effectiveness of our approach using four datasets of up to 90,000 websites. We use ratings from Web of Trust (WOT), a popular crowdsourced web reputation system, as ground truth. We propose a novel ensemble classification technique that makes opportunistic use of available structural and content properties of web pages to predict their eventual ratings in two dimensions used by WOT: trustworthiness and child safety. Ours is the first classification system to predict such subjective ratings. The same approach works equally well in identifying malicious websites. Across all datasets, our classification achieves average F\(_1\)-score in the 74–90 % range.

[1]  Ian Welch,et al.  Identification of malicious web pages through analysis of underlying DNS and web server relationships , 2008, 2008 33rd IEEE Conference on Local Computer Networks (LCN).

[2]  Tyler Moore,et al.  Evaluating the Wisdom of Crowds in Assessing Phishing Websites , 2008, Financial Cryptography.

[3]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Ricardo Chavarriaga,et al.  Benchmarking classification techniques using the Opportunity human activity dataset , 2011, 2011 IEEE International Conference on Systems, Man, and Cybernetics.

[5]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[6]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[7]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[8]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[9]  Andreas Dewald,et al.  Forschungsberichte der Fakultät IV – Elektrotechnik und Informatik C UJO : Efficient Detection and Prevention of Drive-by-Download Attacks , 2010 .

[10]  Stefan Savage,et al.  Spamscatter: Characterizing Internet Scam Hosting Infrastructure , 2007, USENIX Security Symposium.

[11]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Lawrence K. Saul,et al.  Beyond blacklists: learning to detect malicious web sites from suspicious URLs , 2009, KDD.

[14]  Benjamin Livshits,et al.  ZOZZLE: Fast and Precise In-Browser JavaScript Malware Detection , 2011, USENIX Security Symposium.

[15]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[16]  Christopher Krügel,et al.  Detection and analysis of drive-by-download attacks and malicious JavaScript code , 2010, WWW '10.

[17]  Peter Andras,et al.  On preserving statistical characteristics of accelerometry data using their empirical cumulative distribution , 2013, ISWC '13.

[18]  David Heckerman,et al.  Empirical Analysis of Predictive Algorithms for Collaborative Filtering , 1998, UAI.

[19]  Ramana Rao Kompella,et al.  PhishNet: Predictive Blacklisting to Detect Phishing Attacks , 2010, 2010 Proceedings IEEE INFOCOM.

[20]  Nicola Torelli,et al.  Training and assessing classification rules with imbalanced data , 2012, Data Mining and Knowledge Discovery.

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  Adrienne Porter Felt,et al.  Alice in Warningland: A Large-Scale Field Study of Browser Security Warning Effectiveness , 2013, USENIX Security Symposium.

[23]  Svein J. Knapskog,et al.  Re-evaluating the Wisdom of Crowds in Assessing Web Security , 2011, Financial Cryptography.

[24]  Eunjin Jung,et al.  Obfuscated malicious javascript detection using classification techniques , 2009, 2009 4th International Conference on Malicious and Unwanted Software (MALWARE).

[25]  Vern Paxson,et al.  On the Potential of Proactive Domain Blacklisting , 2010, LEET.

[26]  Thomas Plötz,et al.  Using unlabeled data in a sparse-coding framework for human activity recognition , 2014, Pervasive Mob. Comput..

[27]  R. Kay The Analysis of Survival Data , 2012 .

[28]  Sourav Bhattacharyaa,et al.  Towards Using Unlabeled Data in a Sparse-coding Framework for Human Activity Recognition , 2014 .

[29]  Leslie Daigle,et al.  WHOIS Protocol Specification , 2004, RFC.

[30]  P. Komisarczuk,et al.  Identification of Malicious Web Pages with Static Heuristics , 2008, 2008 Australasian Telecommunication Networks and Applications Conference.

[31]  Tyler Moore,et al.  Temporal Correlations between Spam and Phishing Websites , 2009, LEET.

[32]  Giovanni Vigna,et al.  Prophiler: a fast filter for the large-scale detection of malicious web pages , 2011, WWW.

[33]  Patrick Olivier,et al.  Feature Learning for Activity Recognition in Ubiquitous Computing , 2011, IJCAI.

[34]  Eemil Lagerspetz,et al.  The company you keep: mobile malware infection rates and inexpensive risk indicators , 2013, WWW.