论文信息 - Remote homology detection: a motif based approach

Remote homology detection: a motif based approach

MOTIVATION Remote homology detection is the problem of detecting homology in cases of low sequence similarity. It is a hard computational problem with no approach that works well in all cases. RESULTS We present a method for detecting remote homology that is based on the presence of discrete sequence motifs. The motif content of a pair of sequences is used to define a similarity that is used as a kernel for a Support Vector Machine (SVM) classifier. We test the method on two remote homology detection tasks: prediction of a previously unseen SCOP family and prediction of an enzyme class given other enzymes that have a similar function on other substrates. We find that it performs significantly better than an SVM method that uses BLAST or Smith-Waterman similarity scores as features.

Douglas L. Brutlag | Asa Ben-Hur | A. Ben-Hur | D. Brutlag | Asa Ben-Hur

[1] Anthony Widjaja,et al. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[2] M S Waterman,et al. Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[3] James P. Egan,et al. Signal detection theory and ROC analysis , 1975 .

[4] A G Murzin,et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[5] Patrice Koehl,et al. The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[6] David Haussler,et al. Using the Fisher Kernel Method to Detect Remote Protein Homologies , 1999, ISMB.

[7] M. V. Wilkes,et al. The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[8] Alexander J. Smola,et al. Fast Kernels for String and Tree Matching , 2002, NIPS.

[9] Douglas L. Brutlag,et al. The EMOTIF database , 2001, Nucleic Acids Res..

[10] Eleazar Eskin,et al. The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[11] Jason Weston,et al. Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[12] D. Brutlag,et al. Highly specific protein sequence motifs for genome analysis. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[13] K. Tipton,et al. Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB). Enzyme nomenclature. Recommendations 1992. Supplement: corrections and additions. , 1994, European journal of biochemistry.

[14] Li Liao,et al. Combining pairwise sequence similarity and support vector machines for remote protein homology detection , 2002, RECOMB '02.

[15] Amos Bairoch,et al. The PROSITE database, its status in 2002 , 2002, Nucleic Acids Res..

[16] L. L. Lloyd,et al. Enzyme nomenclature — Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology: Academic Press Ltd, London, UK, 1992. xiii + 862 pp. Price £40.00. ISBN 0-12-227165-3 , 1994 .

[17] Shmuel Pietrokovski,et al. Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations , 1999, Bioinform..

[18] Bernhard E. Boser,et al. A training algorithm for optimal margin classifiers , 1992, COLT '92.

[19] Maria Jesus Martin,et al. High-quality Protein Knowledge Resource: SWISS-PROT and TrEMBL , 2002, Briefings Bioinform..

[20] Dustin Boswell,et al. Introduction to Support Vector Machines , 2002 .