An investigation of implicit features in compression-based learning for comparing webpages

We investigate compression-based learning for image classification tasks. These algorithms are claimed to approximate the Kolmogorov complexity of the difference between two object descriptions, but in practice are a measure over an induced feature space. We investigate if these algorithms can be improved via feature selection. Our experiments cover a corpus of legitimate websites and Phishing websites impersonating them; the task is to classify a webpage as either legitimate or a Phish. We perform feature selection in the feature space induced by a well-known compression algorithm (specifically, the entries of the compression dictionary). We then apply four well-known classification algorithms to the reduced feature sets, and conduct a Receiver Operating Characteristic analysis on them. We find that a subset of the features is sufficient for a near-perfect classification of these webpages.

[1]  I. Gordon Theories of Visual Perception , 1989 .

[2]  Terrence J. Sejnowski,et al.  The “independent components” of natural scenes are edge filters , 1997, Vision Research.

[3]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[4]  Dietrich Dörner,et al.  The Logic Of Failure: Recognizing And Avoiding Error In Complex Situations , 1997 .

[5]  D. Field,et al.  Human discrimination of fractal images. , 1990, Journal of the Optical Society of America. A, Optics and image science.

[6]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[7]  Benjamin Hescott,et al.  On Clustering Images Using Compression , 2006 .

[8]  P. Bahr,et al.  Sampling: Theory and Applications , 2020, Applied and Numerical Harmonic Analysis.

[9]  Ming Li,et al.  Image Classification Via LZ78 Based String Kernel: A Comparative Study , 2006, PAKDD.

[10]  Scott Dick,et al.  An Anti-Phishing System Employing Diffused Information , 2014, TSEC.

[11]  Gregory J. Chaitin,et al.  Algorithmic Information Theory , 1987, IBM J. Res. Dev..

[12]  Julien Clinton Sprott,et al.  Automatic generation of strange attractors , 1993, Comput. Graph..

[13]  David G. Stork,et al.  Pattern Classification , 1973 .

[14]  Daniel J. Graham,et al.  Can the theory of “whitening” explain the center-surround properties of retinal ganglion cell receptive fields? , 2006, Vision Research.

[15]  C. Redies,et al.  A universal model of esthetic perception based on the sensory coding of natural stimuli. , 2007, Spatial vision.

[16]  Heiko Schwarz,et al.  Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard , 2003, IEEE Trans. Circuits Syst. Video Technol..

[17]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[18]  Gregory. J. Chaitin,et al.  Algorithmic information theory , 1987, Cambridge tracts in theoretical computer science.

[19]  David J Field,et al.  Statistical regularities of art images and natural scenes: spectra, sparseness and nonlinearities. , 2007, Spatial vision.

[20]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[21]  D J Field,et al.  Relations between the statistics of natural images and the response properties of cortical cells. , 1987, Journal of the Optical Society of America. A, Optics and image science.

[22]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[23]  G. J. Burton,et al.  Color and spatial structure in natural scenes. , 1987, Applied optics.

[24]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[25]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[26]  Bartosz Krawczyk,et al.  Clustering-based ensembles for one-class classification , 2014, Inf. Sci..

[27]  Konrad Jackowski Evolutionary Adapted Ensemble for Reoccurring Context , 2012, HAIS.

[28]  Simon B. Laughlin,et al.  Visual ecology and voltage-gated ion channels in insect photoreceptors , 1995, Trends in Neurosciences.

[29]  G. Nigel Martin,et al.  * Range encoding: an algorithm for removing redundancy from a digitised message , 1979 .

[30]  Jacques J. Vidal,et al.  Adaptive Range Coding , 1990, NIPS.

[31]  Peter Green,et al.  Markov chain Monte Carlo in Practice , 1996 .

[32]  Rosane Minghim,et al.  Normalized compression distance for visual analysis of document collections , 2007, Comput. Graph..

[33]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[34]  Robert A. Frazor,et al.  Local luminance and contrast in natural images , 2006, Vision Research.

[35]  Scott Dick,et al.  Detecting visually similar Web pages: Application to phishing detection , 2010, TOIT.

[36]  Deborah J. Aks,et al.  Quantifying Aesthetic Preference for Chaotic Patterns , 1996 .

[37]  D. Field,et al.  Visual sensitivity, blur and the sources of variability in the amplitude spectra of natural scenes , 1997, Vision Research.

[38]  V. Billock Neural acclimation to 1/ f spatial frequency spectra in natural images transduced by the human visual system , 2000 .

[39]  Sam Kwong,et al.  A Compression Algorithm for DNA Sequences and Its Applications in Genome Comparison. , 1999 .

[40]  David J Field,et al.  Variations in Intensity Statistics for Representational and Abstract Art, and for Art from the Eastern and Western Hemispheres , 2008, Perception.

[41]  George Economou,et al.  Dictionary based color image retrieval , 2008, J. Vis. Commun. Image Represent..

[42]  D. Tolhurst,et al.  Amplitude spectra of natural images , 1992 .

[43]  Lisa M. Graham,et al.  Gestalt Theory in Interactive Media Design , 2007 .

[44]  Mirja Kälviäinen,et al.  The role of sign elements in holistic product meaning , 2007 .

[45]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[46]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[47]  András Kocsor,et al.  Sequence analysis Application of compression-based distance measures to protein sequence classification : a methodological study , 2005 .

[48]  Ben R. Newell,et al.  Universal aesthetic of fractals , 2003, Comput. Graph..

[49]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[50]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[51]  George J. Klir,et al.  Fuzzy sets and fuzzy logic - theory and applications , 1995 .

[52]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[53]  Alfonso Ortega,et al.  Common Pitfalls Using the Normalized Compression Distance: What to Watch Out for in a Compressor , 2005, Commun. Inf. Syst..

[54]  Jean-Philippe Bouchaud,et al.  Mutual attractions: physics and finance , 1999 .

[55]  David J. Field,et al.  What Is the Goal of Sensory Coding? , 1994, Neural Computation.

[56]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[57]  D. Tolhurst,et al.  The human visual system is optimised for processing the spatial information in natural visual images , 2000, Current Biology.

[58]  W. R. Brown,et al.  Statistics of Color-Matching Data* , 1952 .

[59]  Joachim Denzler,et al.  Fractal-like image statistics in visual art: similarity to natural scenes. , 2007, Spatial vision.

[60]  Carla E. Brodley,et al.  Compression and machine learning: a new perspective on feature space vectors , 2006, Data Compression Conference (DCC'06).

[61]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[62]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[63]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[64]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[65]  Terry Purcell,et al.  Fractal dimension of landscape silhouette outlines as a predictor of landscape preference , 2004 .

[66]  Bernice E. Rogowitz,et al.  Shape perception and low-dimension fractal boundary contours , 1990, Other Conferences.

[67]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.