Detecting visually similar Web pages: Application to phishing detection

We propose a novel approach for detecting visual similarity between two Web pages. The proposed approach applies Gestalt theory and considers a Web page as a single indivisible entity. The concept of supersignals, as a realization of Gestalt principles, supports our contention that Web pages must be treated as indivisible entities. We objectify, and directly compare, these indivisible supersignals using algorithmic complexity theory. We illustrate our approach by applying it to the problem of detecting phishing scams. Via a large-scale, real-world case study, we demonstrate that 1) our approach effectively detects similar Web pages; and 2) it accuractely distinguishes legitimate and phishing pages.

[1]  Yuxuan Lan,et al.  Image classification using compression distance , 2005, VVG.

[2]  Yanchun Zhang,et al.  Utilizing Hyperlink Transitivity to Improve Web Page Clustering , 2003, ADC.

[3]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[4]  Sylvia Richardson,et al.  Markov Chain Monte Carlo in Practice , 1997 .

[5]  Wendy E. Mackay,et al.  Triggers and barriers to customizing software , 1991, CHI.

[6]  S. Avidan,et al.  Seam carving for content-aware image resizing , 2007, SIGGRAPH 2007.

[7]  K. Strimmer,et al.  Quartet Puzzling: A Quartet Maximum-Likelihood Method for Reconstructing Tree Topologies , 1996 .

[8]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[9]  Min Wu,et al.  Do security toolbars actually prevent phishing attacks? , 2006, CHI.

[10]  Alan C. Bovik,et al.  A Statistical Evaluation of Recent Full Reference Image Quality Assessment Algorithms , 2006, IEEE Transactions on Image Processing.

[11]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[12]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[13]  Cormac Herley,et al.  Stopping a Phishing Attack, Even when the Victims Ignore Warnings , 2005 .

[14]  Kwang-Ting Cheng,et al.  Using visual features for anti-spam filtering , 2005, IEEE International Conference on Image Processing 2005.

[15]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[16]  Patrick Le Callet,et al.  An image quality assessment method based on perception of structural information , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[17]  Henry Beker,et al.  Cipher Systems: The Protection of Communications , 1982 .

[18]  András Kocsor,et al.  Sequence analysis Application of compression-based distance measures to protein sequence classification : a methodological study , 2005 .

[19]  M. Wertheimer,et al.  Gestalt Theory , 2019, Theories and Applications of Counseling and Psychotherapy: Relevance Across Cultures and Settings.

[20]  Xiaotie Deng,et al.  Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover's Distance (EMD) , 2006, IEEE Transactions on Dependable and Secure Computing.

[21]  Lisa M. Graham,et al.  Gestalt Theory in Interactive Media Design , 2007 .

[22]  Mirja Kälviäinen,et al.  The role of sign elements in holistic product meaning , 2007 .

[23]  Pietro Perona,et al.  Learning object categories from Google's image search , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[24]  Gyorgy Kepes,et al.  Language of Vision , 1944 .

[25]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[26]  Masaru Kitsuregawa,et al.  Evaluating contents-link coupled web page clustering for web search results , 2002, CIKM '02.

[27]  Benjamin Hescott,et al.  On Clustering Images Using Compression , 2006 .

[28]  D. Dörner The logic of failure. , 1990, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[29]  Ming Li,et al.  Image Classification Via LZ78 Based String Kernel: A Comparative Study , 2006, PAKDD.

[30]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[31]  A. Bovik,et al.  Image Quality Assessment , 2012 .

[32]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[33]  Lorrie Faith Cranor,et al.  Phinding Phish: An Evaluation of Anti-Phishing Toolbars , 2007, NDSS.

[34]  John C. Mitchell,et al.  Client-Side Defense Against Web-Based Identity Theft , 2004, NDSS.

[35]  David Salomon,et al.  Data Compression , 2000, Springer Berlin Heidelberg.

[36]  Scott Dick,et al.  Prevalence and classification of web page defects , 2010, Online Inf. Rev..

[37]  Manuel Cebrián,et al.  The Normalized Compression Distance Is Resistant to Noise , 2007, IEEE Transactions on Information Theory.

[38]  Lorrie Faith Cranor,et al.  Phinding Phish: An Evaluation of Anti-Phishing Toolbars , 2007, NDSS.

[39]  Jacques J. Vidal,et al.  Adaptive Range Coding , 1990, NIPS.

[40]  George Economou,et al.  Dictionary based color image retrieval , 2008, J. Vis. Commun. Image Represent..

[41]  Manuel Cebrián,et al.  Evaluating the Impact of Information Distortion on Normalized Compression Distance , 2008, ICMCTA.

[42]  Zhou Wang,et al.  Multiscale structural similarity for image quality assessment , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[43]  Leonardo Vidal Batista,et al.  Texture classification using local and global histogram equalization and the Lempel-Ziv-Welch algorithm , 2005, Fifth International Conference on Hybrid Intelligent Systems (HIS'05).

[44]  Marti A. Hearst,et al.  Why phishing works , 2006, CHI.

[45]  I. Gordon Theories of Visual Perception , 1989 .

[46]  Alexandra Cernian,et al.  Clustering Heterogeneous Web Data using Clustering by Compression. Cluster Validity , 2008, 2008 10th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.

[47]  John J. Zasio,et al.  SSIM: A Software Levelized Compiled-Code Simulator , 1987, 24th ACM/IEEE Design Automation Conference.

[48]  I. Rock,et al.  Inattentional blindness: Perception without attention. , 1998 .

[49]  Mateu Sbert,et al.  Compression-based Image Registration , 2006, 2006 IEEE International Symposium on Information Theory.

[50]  Norman M. Sadeh,et al.  Learning to detect phishing emails , 2007, WWW '07.

[51]  Christopher Krügel,et al.  On the Effectiveness of Techniques to Detect Phishing Sites , 2007, DIMVA.

[52]  Tony Gorschek,et al.  Searching for Cognitively Diverse Tests: Towards Universal Test Diversity Metrics , 2008, 2008 IEEE International Conference on Software Testing Verification and Validation Workshop.

[53]  Marcus Hutter,et al.  Algorithmic Information Theory , 1977, IBM J. Res. Dev..

[54]  Jason I. Hong,et al.  A hybrid phish detection approach by identity discovery and keywords retrieval , 2009, WWW '09.

[55]  Rosane Minghim,et al.  Normalized compression distance for visual analysis of document collections , 2007, Comput. Graph..

[56]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[57]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[58]  Christopher Krügel,et al.  A layout-similarity-based approach for detecting phishing pages , 2007, 2007 Third International Conference on Security and Privacy in Communications Networks and the Workshops - SecureComm 2007.

[59]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[60]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[61]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[62]  John B. O'Neal Differential pulse-code modulation (PCM) with entropy coding , 1976, IEEE Trans. Inf. Theory.

[63]  Daniel Andresen,et al.  Scalability issues for high performance digital libraries on the World Wide Web , 1996, Proceedings of the Third Forum on Research and Technology Advances in Digital Libraries,.

[64]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[65]  Alexander Toet,et al.  A new universal colour image fidelity metric , 2003 .

[66]  D. Garrison,et al.  Methodological Issues in the Content Analysis of Computer Conference Transcripts , 2007 .

[67]  Lorrie Faith Cranor,et al.  Cantina: a content-based approach to detecting phishing web sites , 2007, WWW '07.

[68]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[69]  Heiko Schwarz,et al.  Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard , 2003, IEEE Trans. Circuits Syst. Video Technol..

[70]  Tiago Rosa Maria Paula Queluz,et al.  No-reference image quality assessment based on DCT domain statistics , 2008, Signal Process..

[71]  Joan L. Mitchell,et al.  JPEG: Still Image Data Compression Standard , 1992 .

[72]  A. Emigh,et al.  Online Identity Theft: Phishing Technology, Chokepoints and Countermeasures , 2005 .

[73]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[74]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[75]  Stefan Winkler,et al.  JPEG vs. JPEG 2000: an objective comparison of image encoding quality , 2004, SPIE Optics + Photonics.

[76]  Alan C. Bovik,et al.  No-reference quality assessment using natural scene statistics: JPEG2000 , 2005, IEEE Transactions on Image Processing.

[77]  Haining Wang,et al.  BogusBiter: A transparent protection against phishing attacks , 2010, TOIT.

[78]  H M Quiney,et al.  Iterative image reconstruction algorithms using wave-front intensity and phase variation. , 2005, Optics letters.

[79]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[80]  Alfonso Ortega,et al.  Common Pitfalls Using the Normalized Compression Distance: What to Watch Out for in a Compressor , 2005, Commun. Inf. Syst..

[81]  Bin Ma,et al.  Chain letters & evolutionary histories. , 2003, Scientific American.

[82]  J. Doug Tygar,et al.  The battle against phishing: Dynamic Security Skins , 2005, SOUPS '05.

[83]  Gregory J. Chaitin,et al.  Algorithmic Information Theory , 1987, IBM J. Res. Dev..

[84]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[85]  David Salomon,et al.  Data Compression: The Complete Reference , 2006 .

[86]  Sarah Jane Delany,et al.  Textual case-based reasoning for spam filtering: a comparison of feature-based and feature-free approaches , 2006, Artificial Intelligence Review.

[87]  Geoff Hulten,et al.  Learning at Low False Positive Rates , 2006, CEAS.

[88]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[89]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[90]  Andrew Perkis,et al.  No-reference JPEG-image quality assessment using GAP-RBF , 2007, Signal Process..

[91]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.