Automatic generation of bioinformatics tools for predicting protein–ligand binding sites

MOTIVATION Predictive tools that model protein-ligand binding on demand are needed to promote ligand research in an innovative drug-design environment. However, it takes considerable time and effort to develop predictive tools that can be applied to individual ligands. An automated production pipeline that can rapidly and efficiently develop user-friendly protein-ligand binding predictive tools would be useful. RESULTS We developed a system for automatically generating protein-ligand binding predictions. Implementation of this system in a pipeline of Semantic Web technique-based web tools will allow users to specify a ligand and receive the tool within 0.5-1 day. We demonstrated high prediction accuracy for three machine learning algorithms and eight ligands. AVAILABILITY AND IMPLEMENTATION The source code and web application are freely available for download at http://utprot.net They are implemented in Python and supported on Linux. CONTACT shimizu@bi.a.u-tokyo.ac.jp SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Torsten Schwede,et al.  Assessment of ligand‐binding residue predictions in CASP9 , 2011, Proteins.

[2]  Trilce Estrada,et al.  A scalable and accurate method for classifying protein-ligand binding geometries using a MapReduce approach , 2012, Comput. Biol. Medicine.

[3]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[4]  Yang Zhang,et al.  BioLiP: a semi-manually curated database for biologically relevant ligand–protein interactions , 2012, Nucleic Acids Res..

[5]  Chin-Sheng Yu,et al.  Prediction of Metal Ion–Binding Sites in Proteins Using the Fragment Transformation Method , 2012, PloS one.

[6]  Michele Magrane,et al.  UniProt Knowledgebase: a hub of integrated protein data , 2011, Database J. Biol. Databases Curation.

[7]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  Luca Scrucca,et al.  GA: A Package for Genetic Algorithms in R , 2013 .

[10]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[11]  Robert Petryszak,et al.  UniChem: a unified chemical structure cross-referencing and identifier tracking system , 2013, Journal of Cheminformatics.

[12]  Maya Petersen,et al.  Computationally efficient confidence intervals for cross-validated area under the ROC curve estimates. , 2015, Electronic journal of statistics.

[13]  S. Bryant,et al.  Critical assessment of methods of protein structure prediction (CASP): Round II , 1997, Proteins.

[14]  Christian S. Perone,et al.  Pyevolve: a Python open-source framework for genetic algorithms , 2009, SEVO.

[15]  Yang Zhang,et al.  COFACTOR: an accurate comparative algorithm for structure-based protein function annotation , 2012, Nucleic Acids Res..

[16]  Xiaolong Wang,et al.  Protein Binding Site Prediction by Combining Hidden Markov Support Vector Machine and Profile-Based Propensities , 2014, TheScientificWorldJournal.

[17]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[18]  Michel Dumontier,et al.  SPARQL-enabled identifier conversion with Identifiers.org , 2015, Bioinform..

[19]  Juergen Haas,et al.  The Protein Model Portal—a comprehensive resource for protein structure and model information , 2013, Database J. Biol. Databases Curation.

[20]  Daniel J. Blankenberg,et al.  Galaxy: A Web‐Based Genome Analysis Tool for Experimentalists , 2010, Current protocols in molecular biology.

[21]  Akira R. Kinjo,et al.  Protein Data Bank Japan (PDBj): maintaining a structural data archive and resource description framework format , 2011, Nucleic Acids Res..

[22]  Gonzalo López,et al.  Assessment of ligand binding residue predictions in CASP8 , 2009, Proteins.

[23]  George Karypis,et al.  LIBRUS: combined machine learning and homology information for sequence-based ligand-binding residue prediction , 2009, Bioinform..

[24]  K. Bretonnel Cohen,et al.  BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains , 2014, Journal of Biomedical Semantics.

[25]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[26]  Xin Gao,et al.  LigandRFs: random forest ensemble to identify ligand-binding residues from sequence information alone , 2014, BMC Bioinformatics.

[27]  Yang Zhang,et al.  Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment , 2013, Bioinform..

[28]  Nanjiang Shu,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm618 Sequence analysis Prediction of zinc-binding sites in proteins from sequence , 2008 .

[29]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[30]  Nicole Tourigny,et al.  Bio2RDF: Towards a mashup to build bioinformatics knowledge systems , 2008, J. Biomed. Informatics.

[31]  Hisashi Narimatsu,et al.  WURCS: The Web3 Unique Representation of Carbohydrate Structures , 2014, J. Chem. Inf. Model..

[32]  S F Altschul,et al.  Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. , 1998, Trends in biochemical sciences.

[33]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP)—round IX , 2011, Proteins.

[34]  Maria Jesus Martin,et al.  SIFTS: Structure Integration with Function, Taxonomy and Sequences resource , 2012, Nucleic Acids Res..

[35]  Yanzhi Guo,et al.  Prediction of Lipid-Binding Sites Based on Support Vector Machine and Position Specific Scoring Matrix , 2010, The protein journal.

[36]  Michal Brylinski,et al.  FINDSITELHM: A Threading-Based Approach to Ligand Homology Modeling , 2009, PLoS Comput. Biol..

[37]  José Manuel Benítez,et al.  Neural Networks in R Using the Stuttgart Neural Network Simulator: RSNNS , 2012 .

[38]  Bin Chen,et al.  The ChEMBL database as linked open data , 2013, Journal of Cheminformatics.

[39]  Kimmen Sjölander,et al.  COACH : profile-profile alignment of protein families using hidden Markov models , 2003 .

[40]  Anna Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP) — round x , 2014, Proteins.

[41]  M. Michael Gromiha,et al.  Development of a Machine Learning Method to Predict Membrane Protein-Ligand Binding Residues Using Basic Sequence Information , 2015, Adv. Bioinformatics.

[42]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .