Support Vector Machine Prediction of Signal Peptide Cleavage Site Using a New Class of Kernels for Strings

A new class of kernels for strings is introduced. These kernels can be used by any kernel-based data analysis method, including support vector machines (SVM). They are derived from probabilistic models to integrate biologically relevant information. We show how to compute the kernels corresponding to several classical probabilistic models, and illustrate their use by building a SVM for the problem of predicting the cleavage site of signal peptides from the amino-acid sequence of a protein. At a given rate of false positive this method retrieves up to 47% more true positives than the classical weight matrix method.