Sequence analysis More challenges for machine-learning protein interactions

Motivation: Machine learning may be the most popular computational tool in molecular biology. Providing sustained performance estimates is challenging. The standard cross-validation protocols usually fail in biology. Park and Marcotte found that even refined protocols fail for protein–protein interactions (PPIs). Results: Here, we sketch additional problems for the prediction of PPIs from sequence alone. First, it not only matters whether proteins A or B of a target interaction A–B are similar to proteins of training interactions (positives), but also whether A or B are similar to proteins of non-interactions (negatives). Second, training on multiple interaction partners per protein did not improve performance for new proteins (not used to train). In contrary, a strictly non-redundant training that ignored good data slightly improved the prediction of difficult cases. Third, which prediction method appears to be best crucially depends on the sequence similarity between the test and the training set, how many true interactions should be found and the expected ratio of negatives to positives. The correct assessment of performance is the most complicated task in the development of prediction methods. Our analyses suggest that PPIs square the challenge for this task. Availability and implementation: Datasets used in our analyses are available at https://rostlab.org/ owiki/index.php/PPI_challenges Contact: rost@in.tum.de Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  P. Uetz,et al.  The binary protein-protein interaction landscape of Escherichia coli , 2014, Nature Biotechnology.

[2]  E. Marcotte,et al.  A flaw in the typical evaluation scheme for pair-input computational predictions , 2012, Nature Methods.

[3]  Martin H. Schaefer,et al.  HIPPIE: Integrating Protein Interaction Networks with Experiment Based Quality Scores , 2012, PloS one.

[4]  Javier Herrero,et al.  Toward community standards in the quest for orthologs , 2012, Bioinform..

[5]  Nicolas Thierry-Mieg,et al.  New insights into protein-protein interaction data lead to increased estimates of the S. cerevisiae interactome size , 2010, BMC Bioinformatics.

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  A. Barabasi,et al.  An empirical framework for binary interactome mapping , 2008, Nature Methods.

[8]  J. R. Green,et al.  Global investigation of protein–protein interactions in yeast Saccharomyces cerevisiae using re-occurring short polypeptide sequences , 2008, Nucleic acids research.

[9]  Hyeong Jun An,et al.  Estimating the size of the human interactome , 2008, Proceedings of the National Academy of Sciences.

[10]  Yanzhi Guo,et al.  Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences , 2008, Nucleic acids research.

[11]  Burkhard Rost,et al.  Protein–Protein Interactions More Conserved within Species than across Species , 2006, PLoS Comput. Biol..

[12]  Jean-Loup Faulon,et al.  Predicting protein-protein interactions using signature products , 2005, Bioinform..

[13]  Adam J. Smith,et al.  The Database of Interacting Proteins: 2004 update , 2004, Nucleic Acids Res..

[14]  Burkhard Rost,et al.  UniqueProt: creating representative protein sequence sets , 2003, Nucleic Acids Res..

[15]  B. Rost Enzyme function less conserved than anticipated. , 2002, Journal of molecular biology.

[16]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[17]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[18]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.