PhishStorm: Detecting Phishing With Streaming Analytics

Despite the growth of prevention techniques, phishing remains an important threat since the principal countermeasures in use are still based on reactive URL blacklisting. This technique is inefficient due to the short lifetime of phishing Web sites, making recent approaches relying on real-time or proactive phishing URL detection techniques more appropriate. In this paper, we introduce PhishStorm, an automated phishing detection system that can analyze in real time any URL in order to identify potential phishing sites. PhishStorm can interface with any email server or HTTP proxy. We argue that phishing URLs usually have few relationships between the part of the URL that must be registered (low-level domain) and the remaining part of the URL (upper-level domain, path, query). We show in this paper that experimental evidence supports this observation and can be used to detect phishing sites. For this purpose, we define the new concept of intra-URL relatedness and evaluate it using features extracted from words that compose a URL based on query data from Google and Yahoo search engines. These features are then used in machine-learning-based classification to detect phishing URLs from a real dataset. Our technique is assessed on 96 018 phishing and legitimate URLs that result in a correct classification rate of 94.91% with only 1.44% false positives. An extension for a URL phishingness rating system exhibiting high confidence rate ( $>$ 99%) is proposed. We discuss in this paper efficient implementation patterns that allow real-time analytics using Big Data architectures such as STORM and advanced data structures based on the Bloom filter.

[1]  Sandeep Yadav,et al.  Detecting Algorithmically Generated Domain-Flux Attacks With DNS Traffic Analysis , 2012, IEEE/ACM Transactions on Networking.

[2]  Niels Provos,et al.  A framework for detection and measurement of phishing attacks , 2007, WORM '07.

[3]  Radu State,et al.  Semantic based DNS forensics , 2012, 2012 IEEE International Workshop on Information Forensics and Security (WIFS).

[4]  Radu State,et al.  Proactive Discovery of Phishing Related Domain Names , 2012, RAID.

[5]  Scott Dick,et al.  Detecting visually similar Web pages: Application to phishing detection , 2010, TOIT.

[6]  Phillip A. Porras,et al.  Highly Predictive Blacklisting , 2008, USENIX Security Symposium.

[7]  Joachim Posegga,et al.  PhishSafe: leveraging modern JavaScript API's for transparent and robust protection , 2014, CODASPY '14.

[8]  Paul V. Mockapetris,et al.  Domain names: Concepts and facilities , 1983, RFC.

[9]  Christopher Kruegel Proceedings of the 2007 ACM workshop on Recurring malcode , 2007, CCS 2007.

[10]  Simon Brown,et al.  Detecting Phishing Emails Using Hybrid Features , 2009, 2009 Symposia and Workshops on Ubiquitous, Autonomic and Trusted Computing.

[11]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[12]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[13]  Paul V. Mockapetris,et al.  Domain names - implementation and specification , 1987, RFC.

[14]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[15]  Michalis Faloutsos,et al.  PhishDef: URL names say it all , 2010, 2011 Proceedings IEEE INFOCOM.

[16]  Scott Dick,et al.  An Anti-Phishing System Employing Diffused Information , 2014, TSEC.

[17]  Peter Kolb,et al.  DISCO: A Multilingual Database of Distributionally Similar Words , 2008 .

[18]  Gary Warner,et al.  Phishing: Crime that pays , 2011, 2011 eCrime Researchers Summit.

[19]  Ramana Rao Kompella,et al.  PhishLive: A View of Phishing and Malware Attacks from an Edge Router , 2013, PAM.

[20]  Gerhard Paass,et al.  Improved Phishing Detection using Model-Based Features , 2008, CEAS.

[21]  George Forman,et al.  Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement , 2010, SKDD.

[22]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[23]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[24]  Jörg Rech,et al.  Discovering trends in software engineering with google trend , 2007, SOEN.

[25]  Shambhu J. Upadhyaya,et al.  PHONEY: mimicking user response to detect phishing attacks , 2006, 2006 International Symposium on a World of Wireless, Mobile and Multimedia Networks(WoWMoM'06).

[26]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[27]  Antony J. Williams,et al.  Beautiful Data: The Stories Behind Elegant Data Solutions , 2009 .

[28]  Harry Wechsler,et al.  phishGILLNET—phishing detection methodology using probabilistic latent semantic analysis, AdaBoost, and co-training , 2012 .

[29]  Lawrence K. Saul,et al.  Beyond blacklists: learning to detect malicious web sites from suspicious URLs , 2009, KDD.

[30]  Nick Feamster,et al.  Building a Dynamic Reputation System for DNS , 2010, USENIX Security Symposium.

[31]  Wolfgang Nejdl,et al.  Improving distributed join efficiency with extended bloom filter operations , 2007, 21st International Conference on Advanced Information Networking and Applications (AINA '07).

[32]  Sandeep Yadav,et al.  Detecting algorithmically generated malicious domain names , 2010, IMC '10.

[33]  Leyla Bilge,et al.  EXPOSURE: Finding Malicious Domains Using Passive DNS Analysis , 2011, NDSS.

[34]  Ramana Rao Kompella,et al.  PhishNet: Predictive Blacklisting to Detect Phishing Attacks , 2010, 2010 Proceedings IEEE INFOCOM.

[35]  Lorrie Faith Cranor,et al.  Cantina: a content-based approach to detecting phishing web sites , 2007, WWW '07.

[36]  Leyla Bilge,et al.  Exposure: A Passive DNS Analysis Service to Detect and Report Malicious Domains , 2014, TSEC.

[37]  Youssef Iraqi,et al.  Lexical URL analysis for discriminating phishing and legitimate e-mail messages , 2011, 2011 International Conference for Internet Technology and Secured Transactions.

[38]  Radu State,et al.  PhishScore: Hacking phishers' minds , 2014, 10th International Conference on Network and Service Management (CNSM) and Workshop.

[39]  G. Sexton,et al.  Intelligent phishing detection parameter framework for E-banking transactions based on Neuro-fuzzy , 2014, 2014 Science and Information Conference.

[40]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[41]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[42]  Jiawei Han,et al.  Survey on web spam detection: principles and algorithms , 2012, SKDD.

[43]  Norman M. Sadeh,et al.  Learning to detect phishing emails , 2007, WWW '07.

[44]  Hui Xiong,et al.  Detecting and Tracking Topics and Events from Web Search Logs , 2012, TOIS.

[45]  Eric Medvet,et al.  Visual-similarity-based phishing detection , 2008, SecureComm.

[46]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[47]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[48]  Jason I. Hong,et al.  A hybrid phish detection approach by identity discovery and keywords retrieval , 2009, WWW '09.