An Efficient Similarity Measure for Clustering of Categorical Sequences

In this paper, we propose an efficient similarity measure as pre-processing method for clustering of categorical and sequential attributes. The similarity measure is based on a new dynamic programming algorithm, which computes sequence comparison scoring from the gap penalty matrix. This is presented by normalizing sequence comparison scoring. Self-evaluation of the proposed similarity measure is conducted by experimental results of clustering, which is an unsupervised learning algorithm greatly influenced by similarity measure between clusters. In the experiment, Tcpdump Data from DARPA 1999 Intrusion Detection Evaluation Data Sets are used. These transmission data are composed of sequential packet data in a network. Finally, the results of comparison experiments are discussed.