论文信息 - An investigation of implicit features in compression-based learning for comparing webpages

An investigation of implicit features in compression-based learning for comparing webpages

We investigate compression-based learning for image classification tasks. These algorithms are claimed to approximate the Kolmogorov complexity of the difference between two object descriptions, but in practice are a measure over an induced feature space. We investigate if these algorithms can be improved via feature selection. Our experiments cover a corpus of legitimate websites and Phishing websites impersonating them; the task is to classify a webpage as either legitimate or a Phish. We perform feature selection in the feature space induced by a well-known compression algorithm (specifically, the entries of the compression dictionary). We then apply four well-known classification algorithms to the reduced feature sets, and conduct a Receiver Operating Characteristic analysis on them. We find that a subset of the features is sufficient for a near-perfect classification of these webpages.

Scott Dick | James Miller | Teh-Chung Chen | Torin Stepan

[1] I. Gordon. Theories of Visual Perception , 1989 .

[2] Terrence J. Sejnowski,et al. The “independent components” of natural scenes are edge filters , 1997, Vision Research.

[3] Abraham Lempel,et al. A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[4] Dietrich Dörner,et al. The Logic Of Failure: Recognizing And Avoiding Error In Complex Situations , 1997 .

[5] D. Field,et al. Human discrimination of fractal images. , 1990, Journal of the Optical Society of America. A, Optics and image science.

[6] Xin Chen,et al. A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[7] Benjamin Hescott,et al. On Clustering Images Using Compression , 2006 .

[8] P. Bahr,et al. Sampling: Theory and Applications , 2020, Applied and Numerical Harmonic Analysis.

[9] Ming Li,et al. Image Classification Via LZ78 Based String Kernel: A Comparative Study , 2006, PAKDD.

[10] Scott Dick,et al. An Anti-Phishing System Employing Diffused Information , 2014, TSEC.

[11] Gregory J. Chaitin,et al. Algorithmic Information Theory , 1987, IBM J. Res. Dev..

[12] Julien Clinton Sprott,et al. Automatic generation of strange attractors , 1993, Comput. Graph..

[13] David G. Stork,et al. Pattern Classification , 1973 .

[14] Daniel J. Graham,et al. Can the theory of “whitening” explain the center-surround properties of retinal ganglion cell receptive fields? , 2006, Vision Research.

[15] C. Redies,et al. A universal model of esthetic perception based on the sensory coding of natural stimuli. , 2007, Spatial vision.

[16] Heiko Schwarz,et al. Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard , 2003, IEEE Trans. Circuits Syst. Video Technol..

[17] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[18] Gregory. J. Chaitin,et al. Algorithmic information theory , 1987, Cambridge tracts in theoretical computer science.

[19] David J Field,et al. Statistical regularities of art images and natural scenes: spectra, sparseness and nonlinearities. , 2007, Spatial vision.

[20] J. Rice. Mathematical Statistics and Data Analysis , 1988 .

[21] D J Field,et al. Relations between the statistics of natural images and the response properties of cortical cells. , 1987, Journal of the Optical Society of America. A, Optics and image science.

[22] Paul M. B. Vitányi,et al. Clustering by compression , 2003, IEEE Transactions on Information Theory.

[23] G. J. Burton,et al. Color and spatial structure in natural scenes. , 1987, Applied optics.

[24] Brian D. Davison,et al. Web page classification: Features and algorithms , 2009, CSUR.

[25] Bin Ma,et al. The similarity metric , 2001, IEEE Transactions on Information Theory.

[26] Bartosz Krawczyk,et al. Clustering-based ensembles for one-class classification , 2014, Inf. Sci..

[27] Konrad Jackowski. Evolutionary Adapted Ensemble for Reoccurring Context , 2012, HAIS.

[28] Simon B. Laughlin,et al. Visual ecology and voltage-gated ion channels in insect photoreceptors , 1995, Trends in Neurosciences.

[29] G. Nigel Martin,et al. * Range encoding: an algorithm for removing redundancy from a digitised message , 1979 .

[30] Jacques J. Vidal,et al. Adaptive Range Coding , 1990, NIPS.

[31] Peter Green,et al. Markov chain Monte Carlo in Practice , 1996 .

[32] Rosane Minghim,et al. Normalized compression distance for visual analysis of document collections , 2007, Comput. Graph..

[33] David J. Field,et al. Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[34] Robert A. Frazor,et al. Local luminance and contrast in natural images , 2006, Vision Research.

[35] Scott Dick,et al. Detecting visually similar Web pages: Application to phishing detection , 2010, TOIT.

[36] Deborah J. Aks,et al. Quantifying Aesthetic Preference for Chaotic Patterns , 1996 .

[37] D. Field,et al. Visual sensitivity, blur and the sources of variability in the amplitude spectra of natural scenes , 1997, Vision Research.

[38] V. Billock. Neural acclimation to 1/ f spatial frequency spectra in natural images transduced by the human visual system , 2000 .

[39] Sam Kwong,et al. A Compression Algorithm for DNA Sequences and Its Applications in Genome Comparison. , 1999 .

[40] David J Field,et al. Variations in Intensity Statistics for Representational and Abstract Art, and for Art from the Eastern and Western Hemispheres , 2008, Perception.

[41] George Economou,et al. Dictionary based color image retrieval , 2008, J. Vis. Commun. Image Represent..

[42] D. Tolhurst,et al. Amplitude spectra of natural images , 1992 .

[43] Lisa M. Graham,et al. Gestalt Theory in Interactive Media Design , 2007 .

[44] Mirja Kälviäinen,et al. The role of sign elements in holistic product meaning , 2007 .

[45] Ron Kohavi,et al. The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[46] Nitesh V. Chawla,et al. SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[47] András Kocsor,et al. Sequence analysis Application of compression-based distance measures to protein sequence classification : a methodological study , 2005 .

[48] Ben R. Newell,et al. Universal aesthetic of fractals , 2003, Comput. Graph..

[49] Larry A. Rendell,et al. The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[50] Wei-Ying Ma,et al. Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[51] George J. Klir,et al. Fuzzy sets and fuzzy logic - theory and applications , 1995 .

[52] Monika Henzinger,et al. Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[53] Alfonso Ortega,et al. Common Pitfalls Using the Normalized Compression Distance: What to Watch Out for in a Compressor , 2005, Commun. Inf. Syst..

[54] Jean-Philippe Bouchaud,et al. Mutual attractions: physics and finance , 1999 .

[55] David J. Field,et al. What Is the Goal of Sensory Coding? , 1994, Neural Computation.

[56] Ian H. Witten,et al. The WEKA data mining software: an update , 2009, SKDD.

[57] D. Tolhurst,et al. The human visual system is optimised for processing the spatial information in natural visual images , 2000, Current Biology.

[58] W. R. Brown,et al. Statistics of Color-Matching Data* , 1952 .

[59] Joachim Denzler,et al. Fractal-like image statistics in visual art: similarity to natural scenes. , 2007, Spatial vision.

[60] Carla E. Brodley,et al. Compression and machine learning: a new perspective on feature space vectors , 2006, Data Compression Conference (DCC'06).

[61] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[62] Tom Fawcett,et al. An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[63] Eamonn J. Keogh,et al. Towards parameter-free data mining , 2004, KDD.

[64] Dan Klein,et al. Evaluating strategies for similarity search on the web , 2002, WWW '02.

[65] Terry Purcell,et al. Fractal dimension of landscape silhouette outlines as a predictor of landscape preference , 2004 .

[66] Bernice E. Rogowitz,et al. Shape perception and low-dimension fractal boundary contours , 1990, Other Conferences.

[67] Igor Kononenko,et al. Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.