Efficient Algorithms for Similarity Measures over Sequential Data: A Look Beyond Kernels

Kernel functions as similarity measures for sequential data have been extensively studied in previous research. This contribution addresses the efficient computation of distance functions and similarity coefficients for sequential data. Two proposed algorithms utilize different data structures for efficient computation and yield a runtime linear in the sequence length. Experiments on network data for intrusion detection suggest the importance of distances and even non-metric similarity measures for sequential data.

[1]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[2]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[3]  Christina S. Leslie,et al.  Fast String Kernels using Inexact Matching for Protein Sequences , 2004, J. Mach. Learn. Res..

[4]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[5]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[6]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[7]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[8]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[9]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[10]  Bernhard Schölkopf,et al.  The Kernel Trick for Distances , 2000, NIPS.

[11]  Gunnar Rätsch,et al.  ARTS: accurate recognition of transcription starts in human , 2006, ISMB.

[12]  Gerard Salton,et al.  Mathematics and Information Retrieval , 1979, J. Documentation.

[13]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[14]  Richard Lippmann,et al.  The 1999 DARPA off-line intrusion detection evaluation , 2000, Comput. Networks.

[15]  Eleazar Eskin,et al.  A GEOMETRIC FRAMEWORK FOR UNSUPERVISED ANOMALY DETECTION: DETECTING INTRUSIONS IN UNLABELED DATA , 2002 .

[16]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[17]  Sushil Jajodia,et al.  Applications of Data Mining in Computer Security , 2002, Advances in Information Security.

[18]  Konrad Rieck,et al.  Detecting Unknown Network Attacks Using Language Models , 2006, DIMVA.

[19]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[20]  Juho Rousu,et al.  Efficient Computation of Gapped Substring Kernels on Large Alphabets , 2005, J. Mach. Learn. Res..

[21]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[22]  Pavel Laskov,et al.  Detection of Intrusions and Malware, and Vulnerability Assessment: 19th International Conference, DIMVA 2022, Cagliari, Italy, June 29 –July 1, 2022, Proceedings , 2022, International Conference on Detection of intrusions and malware, and vulnerability assessment.

[23]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .