Large scale genomic sequence SVM classifiers

In genomic sequence analysis tasks like splice site recognition or promoter identification, large amounts of training sequences are available, and indeed needed to achieve sufficiently high classification performances. In this work we study two recently proposed and successfully used kernels, namely the Spectrum kernel and the Weighted Degree kernel (WD). In particular, we suggest several extensions using Suffix Trees and modifications of an SMO-like SVM training algorithm in order to accelerate the training of the SVMs and their evaluation on test sequences. Our simulations show that for the spectrum kernel and WD kernel, large scale SVM training can be accelerated by factors of 20 and 4 times, respectively, while using much less memory (e.g. no kernel caching). The evaluation on new sequences is often several thousand times faster using the new techniques (depending on the number of Support Vectors). Our method allows us to train on sets as large as one million sequences.

[1]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[2]  M. V. Rossum,et al.  In Neural Computation , 2022 .

[3]  Bernhard Schölkopf,et al.  Inexact Matching String Kernels for Protein Classification , 2004 .

[4]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[5]  Rainer Merkl,et al.  Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites , 2004, BMC Bioinformatics.

[6]  Kimberly Van Auken,et al.  WormBase: a multi-species resource for nematode biology and genomics , 2004, Nucleic Acids Res..

[7]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[8]  B. Schölkopf,et al.  Accurate Splice Site Detection for Caenorhabditis elegans , 2004 .

[9]  Gunnar Rätsch,et al.  Learning Interpretable SVMs for Biological Sequence Classification , 2005, BMC Bioinformatics.

[10]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[11]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[12]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[13]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology , 2003, Nucleic Acids Res..

[14]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[15]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[16]  Gunnar Rätsch,et al.  A New Discriminative Kernel from Probabilistic Models , 2001, Neural Computation.

[18]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[19]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[20]  Jean-Philippe Vert,et al.  Local Alignment Kernels for Biological Sequences , 2004 .

[21]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[22]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[23]  Gunnar Rätsch,et al.  RASE: recognition of alternatively spliced exons in C.elegans , 2005, ISMB.

[24]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[25]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[26]  David Haussler,et al.  Using the Fisher Kernel Method to Detect Remote Protein Homologies , 1999, ISMB.

[27]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[28]  Li Liao,et al.  Combining pairwise sequence similarity and support vector machines for remote protein homology detection , 2002, RECOMB '02.

[29]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[30]  Bernhard Schölkopf,et al.  Support vector learning , 1997 .