Text Mining Improves Prediction of Protein Functional Sites

We present an approach that integrates protein structure analysis and text mining for protein functional site prediction, called LEAP-FS (Literature Enhanced Automated Prediction of Functional Sites). The structure analysis was carried out using Dynamics Perturbation Analysis (DPA), which predicts functional sites at control points where interactions greatly perturb protein vibrations. The text mining extracts mentions of residues in the literature, and predicts that residues mentioned are functionally important. We assessed the significance of each of these methods by analyzing their performance in finding known functional sites (specifically, small-molecule binding sites and catalytic sites) in about 100,000 publicly available protein structures. The DPA predictions recapitulated many of the functional site annotations and preferentially recovered binding sites annotated as biologically relevant vs. those annotated as potentially spurious. The text-based predictions were also substantially supported by the functional site annotations: compared to other residues, residues mentioned in text were roughly six times more likely to be found in a functional site. The overlap of predictions with annotations improved when the text-based and structure-based methods agreed. Our analysis also yielded new high-quality predictions of many functional site residues that were not catalogued in the curated data sources we inspected. We conclude that both DPA and text mining independently provide valuable high-throughput protein functional site predictions, and that integrating the two methods using LEAP-FS further improves the quality of these predictions.

[1]  Michael G. Lerner,et al.  Binding MOAD (Mother Of All Databases) , 2005, Proteins.

[2]  A Valencia,et al.  An Overview of BioCreative II.5 , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  K. Bretonnel Cohen,et al.  MutationFinder: a high-performance system for extracting point mutation mentions from text , 2007, Bioinform..

[4]  T. Schwartz,et al.  Allosteric enhancers, allosteric agonists and ago-allosteric modulators: where do they bind and how do they act? , 2007, Trends in pharmacological sciences.

[5]  Edward N Baker,et al.  The Crystal Structure of Rv1347c, a Putative Antibiotic Resistance Protein from Mycobacterium tuberculosis, Reveals a GCN5-related Fold and Suggests an Alternative Function in Siderophore Biosynthesis*♦ , 2005, Journal of Biological Chemistry.

[6]  J. Ory,et al.  Biochemical and Crystallographic Analyses of a Portal Mutant of the Adipocyte Lipid-binding Protein* , 1997, The Journal of Biological Chemistry.

[7]  Philip E. Bourne,et al.  An ontology driven architecture for derived representations of macromolecular structure , 2002, Bioinform..

[8]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[9]  Fred E. Cohen,et al.  Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors , 2004, Bioinform..

[10]  Christophe Roeder,et al.  Exploring Species-Based Strategies for Gene Normalization , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[12]  K. Bretonnel Cohen,et al.  Intrinsic Evaluation of Text Mining Tools May Not Predict Performance on Realistic Tasks , 2007, Pacific Symposium on Biocomputing.

[13]  H. Nam,et al.  Crystal Structure of the Tandem Phosphatase Domains of RPTP LAR , 1999, Cell.

[14]  Stéphanie Pérot,et al.  Druggable pockets and binding site centric chemical space: a paradigm shift in drug discovery. , 2010, Drug discovery today.

[15]  M. Wall,et al.  Interactions in native binding sites cause a large change in protein dynamics. , 2006, Journal of molecular biology.

[16]  P. Ortiz de Montellano,et al.  The Crystal Structure of Mycobacterium tuberculosisAlkylhydroperoxidase AhpD, a Potential Target for Antitubercular Drug Design* , 2002, The Journal of Biological Chemistry.

[17]  M. Sanner,et al.  Reduced surface: an efficient way to compute molecular surfaces. , 1996, Biopolymers.

[18]  Janet M. Thornton,et al.  The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data , 2004, Nucleic Acids Res..

[19]  R. Jernigan,et al.  Anisotropy of fluctuation dynamics of proteins with an elastic network model. , 2001, Biophysical journal.

[20]  Philip E. Bourne,et al.  A robust and efficient algorithm for the shape description of protein structures and its application in predicting ligand binding sites , 2007, BMC Bioinformatics.

[21]  D. Ohlendorf,et al.  Structures of five mutants of toxic shock syndrome toxin-1 with reduced biological activity. , 1998, Biochemistry.

[22]  Z. Xia,et al.  Structure of recombinant human cyclophilin J, a novel member of the cyclophilin family. , 2005, Acta crystallographica. Section D, Biological crystallography.

[23]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[24]  Peter Willett,et al.  Protein Structures and Information Extraction from Biological Texts: The PASTA System , 2003, Bioinform..

[25]  Ligand Binding, Protein Fluctuations, And Allosteric Free Energy , 2006, q-bio/0603027.

[26]  M. Wall,et al.  Allostery in a coarse-grained model of protein dynamics. , 2005, Physical review letters.

[27]  A. Kouznetsov,et al.  Algorithms and semantic infrastructure for mutation impact extraction and grounding , 2010, BMC Genomics.

[28]  Martin Krallinger,et al.  Analysis of biological processes and diseases using text mining approaches. , 2010, Methods in molecular biology.

[29]  Dietrich Rebholz-Schuhmann,et al.  Annotation of protein residues based on a literature analysis: cross-validation against UniProtKb , 2009, BMC Bioinformatics.

[30]  Leo S. D. Caves,et al.  Bio3d: An R Package , 2022 .

[31]  Tirion,et al.  Large Amplitude Elastic Motions in Proteins from a Single-Parameter, Atomic Analysis. , 1996, Physical review letters.

[32]  M. Wall,et al.  Quantifying allosteric effects in proteins , 2005, Proteins.

[33]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[34]  Richard D. Smith,et al.  Binding MOAD, a high-quality protein–ligand database , 2007, Nucleic Acids Res..

[35]  René Witte,et al.  Towards a Systematic Evaluation of protein Mutation Extraction Systems , 2007, J. Bioinform. Comput. Biol..

[36]  D. Eisenberg,et al.  The TB Structural Genomics Consortium: a decade of progress. , 2011, Tuberculosis.

[37]  Karin M. Verspoor,et al.  Pattern Learning through Distant Supervision for Extraction of Protein-Residue Associations in the Biomedical Literature , 2011, 2011 10th International Conference on Machine Learning and Applications and Workshops.

[38]  H. Fromm,et al.  Crystal structures of fructose 1,6-bisphosphatase: mechanism of catalysis and allosteric inhibition revealed in product complexes. , 2000, Biochemistry.

[39]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[40]  Deyu Zhou,et al.  Methodological Review: Extracting interactions between proteins from the literature , 2008 .

[41]  Russ B. Altman,et al.  Tools for loading MEDLINE into a local relational database , 2004, BMC Bioinformatics.

[42]  P E Bourne,et al.  Macromolecular Crystallographic Information File. , 1997, Methods in enzymology.

[43]  K. Bretonnel Cohen,et al.  The structural and content aspects of abstracts versus bodies of full text journal articles are different , 2010, BMC Bioinformatics.

[44]  Alasdair T. R. Laurie,et al.  Methods for the prediction of protein-ligand binding sites for structure-based drug design and virtual ligand screening. , 2006, Current protein & peptide science.

[45]  David T. Jones,et al.  Improving classification in protein structure databases using text mining , 2009, BMC Bioinformatics.

[46]  René Witte,et al.  Mutation Mining—A Prospector's Tale , 2006, Inf. Syst. Frontiers.

[47]  Jie Liang,et al.  CASTp: computed atlas of surface topography of proteins with structural and topographical mapping of functionally annotated residues , 2006, Nucleic Acids Res..

[48]  K. Hinsen Analysis of domain motions by approximate normal mode calculations , 1998, Proteins.

[49]  A. Atilgan,et al.  Direct evaluation of thermal fluctuations in proteins using a single-parameter harmonic potential. , 1997, Folding & design.

[50]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[51]  B. Matthews,et al.  The refined structures of goose lysozyme and its complex with a bound trisaccharide show that the "goose-type" lysozymes lack a catalytic aspartate residue. , 1995, Journal of molecular biology.

[52]  M. Bowman,et al.  Structure-guided programming of polyketide chain-length determination in chalcone synthase. , 2001, Biochemistry.

[53]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[54]  Philip E. Bourne,et al.  [30] Macromolecular crystallographic information file , 1997 .

[55]  M. Wall,et al.  Predicting binding sites by analyzing allosteric effects. , 2012, Methods in molecular biology.

[56]  A. Valencia,et al.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[57]  Judith D. Cohn,et al.  Fast dynamics perturbation analysis for prediction of protein functional sites , 2008, BMC Structural Biology.

[58]  G. Tollin,et al.  Structure-function relationships in Anabaena ferredoxin: correlations between X-ray crystal structures, reduction potentials, and rate constants of electron transfer to ferredoxin:NADP+ reductase for site-specific ferredoxin mutants. , 1997, Biochemistry.

[59]  Karin M. Verspoor,et al.  Protein annotation as term categorization in the gene ontology using word proximity networks , 2005, BMC Bioinformatics.

[60]  Marcelo Fiszman,et al.  Extracting Semantic Predications from Medline Citations for Pharmacogenomics , 2006, Pacific Symposium on Biocomputing.

[61]  Judith D. Cohn,et al.  Prediction of Functional Sites in SCOP Domains using Dynamics Perturbation Analysis , 2008 .