External cross‐validation for unbiased evaluation of protein family detectors: Application to allergens

Key issues in protein science and computational biology are design and evaluation of algorithms aimed at detection of proteins that belong to a specific family, as defined by structural, evolutionary, or functional criteria. In this context, several validation techniques are often used to compare different parameter settings of the detector, and to subsequently select the setting that yields the smallest error rate estimate. A frequently overlooked problem associated with this approach is that this smallest error rate estimate may have a large optimistic bias. Based on computer simulations, we show that a detector's error rate estimate can be overly optimistic and propose a method to obtain unbiased performance estimates of a detector design procedure. The method is founded on an external 10‐fold cross‐validation (CV) loop that embeds an internal validation procedure used for parameter selection in detector design. The designed detector generated in each of the 10 iterations are evaluated on held‐out examples exclusively available in the external CV iterations. Notably, the average of these 10 performance estimates is not associated with a final detector, but rather with the average performance of the design procedure used. We apply the external CV loop to the particular problem of detecting potentially allergenic proteins, using a previously reported design procedure. Unbiased performance estimates of the allergen detector design procedure are presented together with information about which algorithms and parameter settings that are most frequently selected. Proteins 2005. © 2005 Wiley‐Liss, Inc.

[1]  Y.Z. Chen,et al.  Enzyme family classification by support vector machines , 2004, Proteins.

[2]  S. Brunak,et al.  Improved prediction of signal peptides: SignalP 3.0. , 2004, Journal of molecular biology.

[3]  A Elofsson,et al.  Prediction of transmembrane alpha-helices in prokaryotic membrane proteins: the dense alignment surface method. , 1997, Protein engineering.

[4]  Gajendra P. S. Raghava,et al.  ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST , 2004, Nucleic Acids Res..

[5]  Thomas Wetter,et al.  Functional classification of proteins using a nearest neighbour algorithm , 2003, Silico Biol..

[6]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Werner Braun,et al.  Data mining of sequences and 3D structures of allergenic proteins , 2002, Bioinform..

[8]  Tianzi Jiang,et al.  Esub8: A novel tool to predict protein subcellular localizations in eukaryotic organisms , 2004, BMC Bioinformatics.

[9]  D. Soeria-Atmadja,et al.  Statistical Evaluation of Local Alignment Features Predicting Allergenicity Using Supervised Classification Algorithms , 2004, International Archives of Allergy and Immunology.

[10]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[11]  A. Silvanovich,et al.  Bioinformatic Methods for Allergenicity Assessment Using a Comprehensive Allergen Database , 2002, International Archives of Allergy and Immunology.

[12]  B. Rost,et al.  Better prediction of sub‐cellular localization by combining evolutionary and structural information , 2003, Proteins.

[13]  W R Pearson,et al.  Flexible sequence similarity searching with the FASTA3 program package. , 2000, Methods in molecular biology.

[14]  Yu Zong Chen,et al.  Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. , 2004, RNA.

[15]  Zhi-Ping Feng,et al.  An overview on predicting the subcellular location of a protein , 2002, Silico Biol..

[16]  Robert P. W. Duin,et al.  A Matlab Toolbox for Pattern Recognition , 2004 .

[17]  Michael B. Stadler,et al.  Allergenicity prediction by protein sequence , 2003, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[18]  Peteris Prusis,et al.  Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling , 2005, BMC Bioinformatics.

[19]  S. Brunak,et al.  SHORT COMMUNICATION Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites , 1997 .

[20]  Cathy H. Wu,et al.  Protein classification artificial neural system , 1992, Protein science : a publication of the Protein Society.

[21]  Zhiyong Lu,et al.  Predicting subcellular localization of proteins using machine-learned classifiers , 2004, Bioinform..

[22]  Gajendra P. S. Raghava,et al.  SVM based method for predicting HLA-DRB1*0401 binding peptides in an antigen sequence , 2004, Bioinform..

[23]  Thomas Hofmann,et al.  Predicting CNS Permeability of Drug Molecules: Comparison of Neural Network and Support Vector Machine Algorithms , 2002, J. Comput. Biol..

[24]  Werner Braun,et al.  SDAP: database and computational tools for allergenic proteins , 2003, Nucleic Acids Res..

[25]  Quan Pan,et al.  Classification of protein quaternary structure with support vector machine , 2003, Bioinform..

[26]  Burkhard Rost,et al.  LOCnet and LOCtarget: sub-cellular localization for structural genomics targets , 2004, Nucleic Acids Res..

[27]  G. Kleter,et al.  Screening of transgenic proteins expressed in transgenic food crops for the presence of short amino acid sequences identical to potential, IgE – binding linear epitopes of allergens , 2002, BMC Structural Biology.

[28]  David Haussler,et al.  Classifying G-protein coupled receptors with support vector machines , 2002, Bioinform..

[29]  H Nielsen,et al.  Machine learning approaches for the prediction of signal peptides and other protein sorting signals. , 1999, Protein engineering.

[30]  S. Gendel,et al.  Sequence Analysis for Assessing Potential Allergenicity , 2002, Annals of the New York Academy of Sciences.

[31]  Huanwen Tang,et al.  Accurate Classification of Homodimeric vs Other Homooligomeric Proteins Using a New Measure of Information Discrepancy , 2004, J. Chem. Inf. Model..

[32]  B. Rost,et al.  Sequence-based prediction of protein domains. , 2004, Nucleic acids research.

[33]  Arun Krishnan,et al.  Predicting allergenic proteins using wavelet transform , 2004, Bioinform..

[34]  Harpreet Kaur,et al.  Prediction of transmembrane regions of beta-barrel proteins using ANN- and SVM-based methods. , 2004, Proteins.

[35]  Gajendra P.S. Raghava,et al.  Prediction of alpha-turns in proteins using PSI-BLAST profiles and secondary structure information. , 2004, Proteins.

[36]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[37]  Daniel Soeria-Atmadja,et al.  Supervised identification of allergen-representative peptides for in silico detection of potentially allergenic proteins , 2005, Bioinform..

[38]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[39]  R. Wade,et al.  Prediction of drug binding affinities by comparative binding energy analysis. , 1997, Journal of medicinal chemistry.

[40]  D J Vining,et al.  Receiver operating characteristic curves: a basic understanding. , 1992, Radiographics : a review publication of the Radiological Society of North America, Inc.

[41]  G. Schneider,et al.  Advances in the prediction of protein targeting signals , 2004, Proteomics.

[42]  Gajendra P. S. Raghava,et al.  Prediction of α‐turns in proteins using PSI‐BLAST profiles and secondary structure information , 2004 .

[43]  Piero Fariselli,et al.  An ENSEMBLE machine learning approach for the prediction of all-alpha membrane proteins , 2003, ISMB.

[44]  S M Gendel,et al.  The use of amino acid sequence alignments to assess potential allergenicity of proteins used in genetically modified foods. , 1998, Advances in food and nutrition research.

[45]  B. Rost,et al.  Improved prediction of protein secondary structure by use of sequence profiles and neural networks. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[46]  Mats G. Gustafsson,et al.  Prediction of food protein allergenicity: a bioinformatic learning systems approach , 2002, Silico Biol..