Support vector machines for separation of mixed plant?Cpathogen EST collections based on codon usage

MOTIVATION Discovery of host and pathogen genes expressed at the plant-pathogen interface often requires the construction of mixed libraries that contain sequences from both genomes. Sequence identification requires high-throughput and reliable classification of genome origin. When using single-pass cDNA sequences difficulties arise from the short sequence length, the lack of sufficient taxonomically relevant sequence data in public databases and ambiguous sequence homology between plant and pathogen genes. RESULTS A novel method is described, which is independent of the availability of homologous genes and relies on subtle differences in codon usage between plant and fungal genes. We used support vector machines (SVMs) to identify the probable origin of sequences. SVMs were compared to several other machine learning techniques and to a probabilistic algorithm (PF-IND) for expressed sequence tag (EST) classification also based on codon bias differences. Our software (Eclat) has achieved a classification accuracy of 93.1% on a test set of 3217 EST sequences from Hordeum vulgare and Blumeria graminis, which is a significant improvement compared to PF-IND (prediction accuracy of 81.2% on the same test set). EST sequences with at least 50 nt of coding sequence can be classified using Eclat with high confidence. Eclat allows training of classifiers for any host-pathogen combination for which there are sufficient classified training sequences. AVAILABILITY Eclat is freely available on the Internet (http://mips.gsf.de/proj/est) or on request as a standalone version. CONTACT friedel@informatik.uni-muenchen.de.

[1]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[2]  P. Goodwin,et al.  PF-IND: probability algorithm and software for separation of plant and fungal sequences , 2003, Current Genetics.

[3]  Tom Hsiang,et al.  Distinguishing plant and fungal sequences in ESTs from infected plant tissues. , 2003, Journal of microbiological methods.

[4]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[5]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[6]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[7]  L. Duret,et al.  Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Igor V. Tetko,et al.  Neural Network Studies, 4. Introduction to Associative Neural Networks , 2002, J. Chem. Inf. Comput. Sci..

[9]  Hsuan-Tien Lin A Study on Sigmoid Kernels for SVM and the Training of non-PSD Kernels by SMO-type Methods , 2005 .

[10]  A. Kawabe,et al.  Patterns of codon usage bias in three dicot and four monocot plant species. , 2003, Genes & genetic systems.

[11]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[12]  Chih-Jen Lin,et al.  Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel , 2003, Neural Computation.

[13]  N. Bodor,et al.  Neural network studies: Part 3. Prediction of partition coefficients , 1994 .

[14]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[15]  Gregg D. Wilensky,et al.  Neural Network Studies , 1993 .

[16]  Hans-Werner Mewes,et al.  Sputnik: a database platform for comparative plant genomics , 2003, Nucleic Acids Res..

[17]  L. Koski,et al.  The Closest BLAST Hit Is Often Not the Nearest Neighbor , 2001, Journal of Molecular Evolution.

[18]  D C Shields,et al.  Codon usage patterns in Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster and Homo sapiens; a review of the considerable within-species diversity. , 1988, Nucleic acids research.

[20]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[21]  J. Bailey-Serres,et al.  Synonymous codon usage in Zea mays L. nuclear genes is varied by levels of C and G-ending codons. , 1993, Nucleic acids research.