Beyond Phish: Toward Detecting Fraudulent e-Commerce Websites at Scale

Despite recent advancements in malicious website detection and phishing mitigation, the security ecosystem has paid little attention to Fraudulent e-Commerce Websites (FCWs), such as fraudulent shopping websites, fake charities, and cryptocurrency scam websites. Even worse, there are no active large-scale mitigation systems or publicly available datasets for FCWs.In this paper, we first propose an efficient and automated approach to gather FCWs through crowdsourcing. We identify eight different types of non-phishing FCWs and derive key defining characteristics. Then, we find that anti-phishing mitigation systems, such as Google Safe Browsing, have a detection rate of just 0.46% on our dataset. We create a classifier, BEYOND PHISH, to identify FCWs using manually defined features based on our analysis. Validating BEYOND PHISH on never-before-seen (untrained and untested data) through a user study indicates that our system has a high detection rate and a low false positive rate of 98.34% and 1.34%, respectively. Lastly, we collaborated with a major Internet security company, Palo Alto Networks, as well as a major financial services provider, to evaluate our classifier on manually labeled real-world data. The model achieves a false positive rate of 2.46% and a 94.88% detection rate, showing potential for real-world defense against FCWs.

[1]  Nadin Hermann,et al.  Real-Time Detection of Fake-Shops through Machine Learning , 2020, 2020 IEEE International Conference on Big Data (Big Data).

[2]  Yan Shoshitaishvili,et al.  Scam Pandemic: How Attackers Exploit Public Fear through Phishing , 2020, 2020 APWG Symposium on Electronic Crime Research (eCrime).

[3]  Matthew Edwards,et al.  Resource Networks of Pet Scam Websites , 2020, 2020 APWG Symposium on Electronic Crime Research (eCrime).

[4]  Jens Myrup Pedersen,et al.  Towards Adversarial Phishing Detection , 2020, CSET @ USENIX Security Symposium.

[5]  Hwanjun Song,et al.  Learning From Noisy Labels With Deep Neural Networks: A Survey , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[6]  Anis Jarboui,et al.  How the cryptocurrency market has performed during COVID 19? A multifractal analysis , 2020, Finance Research Letters.

[7]  Xiapu Luo,et al.  Characterizing Cryptocurrency Exchange Scams , 2020, Comput. Secur..

[8]  Jeremy Blackburn,et al.  The Pushshift Reddit Dataset , 2020, ICWSM.

[9]  Roberto Perdisci,et al.  What You See is NOT What You Get: Discovering and Tracking Social Engineering Attack Campaigns , 2019, Internet Measurement Conference.

[10]  Linhai Song,et al.  Opening the Blackbox of VirusTotal: Analyzing Online Phishing Scan Engines , 2019, Internet Measurement Conference.

[11]  Marco Wiering,et al.  Combining Visual and Contextual Information for Fraudulent Online Store CIassification , 2019, 2019 IEEE/WIC/ACM International Conference on Web Intelligence (WI).

[12]  John Heidemann,et al.  Precise Detection of Content Reuse in the Web , 2019, CCRV.

[13]  Adam Doupé,et al.  PhishFarm: A Scalable Framework for Measuring the Effectiveness of Evasion Techniques against Browser Phishing Blacklists , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[14]  Ting Yu,et al.  A Survey on Malicious Domains Detection through DNS Data Analysis , 2018, ACM Comput. Surv..

[15]  William K. Robertson,et al.  Surveylance: Automatically Detecting Online Survey Scams , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[16]  Adam Doupé,et al.  Inside a phisher's mind: Understanding the anti-phishing ecosystem through phishing kit analysis , 2018, 2018 APWG Symposium on Electronic Crime Research (eCrime).

[17]  Richard E. Harang,et al.  A Deep Learning Approach to Fast, Format-Agnostic Detection of Malicious Web Content , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[18]  Manos Antonakakis,et al.  Exposing Search and Advertisement Abuse Tactics and Infrastructure of Technical Support Scammers , 2018, WWW.

[19]  Niklas Carlsson,et al.  Server-Side Adoption of Certificate Transparency , 2018, PAM.

[20]  Claudio Carpineto,et al.  Learning to detect and measure fake ecommerce websites in search-engine results , 2017, WI.

[21]  Nir Shavit,et al.  Deep Learning is Robust to Massive Label Noise , 2017, ArXiv.

[22]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[23]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[24]  Konstantin Berlin,et al.  eXpose: A Character-Level Convolutional Neural Network with Embeddings For Detecting Malicious URLs, File Paths and Registry Keys , 2017, ArXiv.

[25]  Steven C. H. Hoi,et al.  Malicious URL Detection using Machine Learning: A Survey , 2017, ArXiv.

[26]  Nick Feamster,et al.  PREDATOR: Proactive Recognition and Elimination of Domain Abuse at Time-Of-Registration , 2016, CCS.

[27]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[28]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[29]  Tyler Moore,et al.  The E-Commerce Market for "Lemons": Identification and Analysis of Websites Selling Counterfeit Goods , 2015, WWW.

[30]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Nizar Kheir,et al.  Mentor: Positive DNS Reputation to Skim-Off Benign Domains in Botnet C&C Blacklists , 2014, SEC.

[33]  Christopher Krügel,et al.  Delta: automatic identification of unknown web-based infection campaigns , 2013, CCS.

[34]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[35]  Shouhuai Xu,et al.  Cross-layer detection of malicious websites , 2013, CODASPY.

[36]  Nicolas Christin,et al.  Traveling the silk road: a measurement analysis of a large anonymous online marketplace , 2012, WWW.

[37]  José Augusto Baranauskas,et al.  How Many Trees in a Random Forest? , 2012, MLDM.

[38]  Rebecca Walker Naylor,et al.  Beyond the “Like” Button: The Impact of Mere Virtual Presence on Brand Evaluations and Purchase Intentions in Social Media Settings , 2012 .

[39]  Carolyn Penstein Rosé,et al.  CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites , 2011, TSEC.

[40]  Heejo Lee,et al.  Detecting Malicious Web Links and Identifying Their Attack Types , 2011, WebApps.

[41]  He Liu,et al.  On the Effects of Registrar-level Intervention , 2011, LEET.

[42]  Giovanni Vigna,et al.  Prophiler: a fast filter for the large-scale detection of malicious web pages , 2011, WWW.

[43]  Lawrence K. Saul,et al.  Beyond blacklists: learning to detect malicious web sites from suspicious URLs , 2009, KDD.

[44]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[45]  Farnam Jahanian,et al.  Shades of grey: On the effectiveness of reputation-based “blacklists” , 2008, 2008 3rd International Conference on Malicious and Unwanted Software (MALWARE).

[46]  Niels Provos,et al.  A framework for detection and measurement of phishing attacks , 2007, WORM '07.

[47]  Stefan Savage,et al.  Spamscatter: Characterizing Internet Scam Hosting Infrastructure , 2007, USENIX Security Symposium.

[48]  Lorrie Faith Cranor,et al.  Cantina: a content-based approach to detecting phishing web sites , 2007, WWW '07.

[49]  Timothy W. Finin,et al.  SVMs for the Blogosphere: Blog Identification and Splog Detection , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[50]  Charles M. Bishop Current address: Microsoft Research, , 2022 .

[51]  Adam Doupé,et al.  Sunrise to Sunset: Analyzing the End-to-end Life Cycle and Effectiveness of Phishing Attacks at Scale , 2020, USENIX Security Symposium.

[52]  Adam Doupé,et al.  PhishTime: Continuous Longitudinal Measurement of the Effectiveness of Anti-phishing Blacklists , 2020, USENIX Security Symposium.

[53]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[54]  Peng Wang,et al.  Cracking the Wall of Confinement: Understanding and Analyzing Malicious Domain Take-downs , 2019, NDSS.

[55]  Samuel Marchal,et al.  On Designing and Evaluating Phishing Webpage Detection Techniques for the Real World , 2018, CSET @ USENIX Security Symposium.

[56]  John Heidemann,et al.  AuntieTuna: Personalized Content-based Phishing Detection , 2016 .

[57]  T. Mansfield,et al.  A Study of Whois Privacy and Proxy Service Abuse , 2013 .

[58]  Akebo Yamakami,et al.  Advances in Spam Filtering Techniques , 2012, Computational Intelligence for Privacy and Security.

[59]  Brian Ryner,et al.  Large-Scale Automatic Classification of Phishing Pages , 2010, NDSS.

[60]  Nick Feamster,et al.  Can DNS-Based Blacklists Keep Up with Bots? , 2006, CEAS.

[61]  Lipo Wang Support vector machines : theory and applications , 2005 .

[62]  Azriel Rosenfeld,et al.  Machine Learning and Data Mining in Pattern Recognition , 2000, Lecture Notes in Computer Science.