String Kernels Based on Variable-Length-Don't-Care Patterns

We propose a new string kernel based on variable-length-don't-care patterns(VLDC patterns). A VLDC pattern is an element of (Σi¾? { i¾? })*, where Σis an alphabet and i¾? is the variable-length-don't-care symbol that matches any string in Σ*. The number of VLDC patterns matching a given string sof length nis O(22n). We present an O(n5 ) algorithm for computing the kernel value. We also propose variations of the kernel which modify the relative weights of each pattern. We evaluate our kernels using a support vector machine to classify spam data.