Algorithms for the Hard Pre-Image Problem of String Kernels and the General Problem of String Prediction

We address the pre-image problem encountered in structured output prediction and the one of finding a string maximizing the prediction function of various kernel-based classifiers and regressors. We demonstrate that these problems reduce to a common combinatorial problem valid for many string kernels. For this problem, we propose an upper bound on the prediction function which has low computational complexity and which can be used in a branch and bound search algorithm to obtain optimal solutions. We also show that for many string kernels, the complexity of the problem increases significantly when the kernel is normalized. On the optical word recognition task, the exact solution of the pre-image problem is shown to significantly improve the prediction accuracy in comparison with an approximation found by the best known heuristic. On the task of finding a string maximizing the prediction function of kernel-based classifiers and regressors, we highlight that existing methods can be biased toward long strings that contain many repeated symbols. We demonstrate that this bias is removed when using normalized kernels. Finally, we present results for the discovery of lead compounds in drug discovery. The source code can be found at https://github.com/a-ro/preimage.

[1]  Jason Weston,et al.  A General Regression Framework for Learning String-to-String Mappings , 2006 .

[2]  Rainer Merkl,et al.  Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites , 2004, BMC Bioinformatics.

[3]  Alexander M. Rush,et al.  A Tutorial on Dual Decomposition and Lagrangian Relaxation for Inference in Natural Language Processing , 2012, J. Artif. Intell. Res..

[4]  B. J. Visser,et al.  Further studies on the structure-activity relationships of bradykinin-potentiating peptides. , 1982, European journal of pharmacology.

[5]  B. Schölkopf,et al.  Accurate Splice Site Detection for Caenorhabditis elegans , 2004 .

[6]  François Laviolette,et al.  Machine Learning Assisted Design of Highly Active Peptides for Drug Discovery , 2015, PLoS Comput. Biol..

[7]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[8]  Marshall L. Fisher,et al.  The Lagrangian Relaxation Method for Solving Integer Programming Problems , 2004, Manag. Sci..

[9]  Gunnar Rätsch,et al.  Exploiting physico-chemical properties in string kernels , 2010, BMC Bioinformatics.

[10]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[11]  François Laviolette,et al.  Learning a peptide-protein binding affinity predictor with kernel ridge regression , 2012, BMC Bioinformatics.

[12]  Gökhan BakIr,et al.  A General Regression Framework for Learning String-to-String Mappings , 2007 .

[13]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[14]  David Wade,et al.  Synthetic antibiotic peptides database. , 2002, Protein and peptide letters.