Text as data: using text-based features for proteins representation and for computational prediction of their characteristics.

The current era of large-scale biology is characterized by a fast-paced growth in the number of sequenced genomes and, consequently, by a multitude of identified proteins whose function has yet to be determined. Simultaneously, any known or postulated information concerning genes and proteins is part of the ever-growing published scientific literature, which is expanding at a rate of over a million new publications per year. Computational tools that attempt to automatically predict and annotate protein characteristics, such as function and localization patterns, are being developed along with systems that aim to support the process via text mining. Most work on protein characterization focuses on features derived directly from protein sequence data. Protein-related work that does aim to utilize the literature typically concentrates on extracting specific facts (e.g., protein interactions) from text. In the past few years we have taken a different route, treating the literature as a source of text-based features, which can be employed just as sequence-based protein-features were used in earlier work, for predicting protein subcellular location and possibly also function. We discuss here in detail the overall approach, along with results from work we have done in this area demonstrating the value of this method and its potential use.

[1]  Juan Miguel García-Gómez,et al.  BIOINFORMATICS APPLICATIONS NOTE Sequence analysis Manipulation of FASTQ data with Galaxy , 2005 .

[2]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[3]  Peer Bork,et al.  Evaluation of human-readable annotation in biomolecular sequence databases with biological rule libraries , 1999, Bioinform..

[4]  Burkhard Rost,et al.  Sequence conserved for subcellular localization , 2002, Protein science : a publication of the Protein Society.

[5]  Zhiyong Lu,et al.  Predicting subcellular localization of proteins using machine-learned classifiers , 2004, Bioinform..

[6]  Hagit Shatkay,et al.  Protein (multi-)location prediction: using location inter-dependencies in a probabilistic framework , 2013, Algorithms for Molecular Biology.

[7]  Eugene Agichtein,et al.  Combining Text Mining and Sequence Analysis to Discover Protein Functional Regions , 2003, Pacific Symposium on Biocomputing.

[8]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[9]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[10]  Bing Yu,et al.  In Silico Tools for Gene Discovery , 2011, Methods in Molecular Biology.

[11]  C. Orengo,et al.  Protein function prediction--the power of multiplicity. , 2009, Trends in biotechnology.

[12]  K. Chou,et al.  iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins , 2011, PloS one.

[13]  Hagit Shatkay,et al.  SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. , 2007, Bioinformatics.

[14]  Miguel A. Andrade-Navarro,et al.  Gene annotation from scientific literature using mappings between keyword systems , 2004, Bioinform..

[15]  E. Kimes,et al.  Evaluation of Vancomycin TDM Strategies: Prediction and Prevention of Kidney Injuries Based on Vancomycin TDM Results , 2023, Journal of Korean medical science.

[16]  Yin Pak Lam,et al.  Comparing Naïve Bayes Classifiers with Support Vector Machines for Predicting Protein Subcellular Location Using Text Features , 2010 .

[17]  M. Sternberg,et al.  Automated prediction of protein function and detection of functional sites from structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Oliver Kohlbacher,et al.  Going from where to why—interpretable prediction of protein subcellular localization , 2010, Bioinform..

[19]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[20]  Oliver Kohlbacher,et al.  MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition , 2006, Bioinform..

[21]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[22]  Limsoon Wong,et al.  Exploiting indirect neighbours and topological weight to predict protein function from protein--protein interactions , 2006 .

[23]  G. Vriend,et al.  A text-mining analysis of the human phenome , 2006, European Journal of Human Genetics.

[24]  Ulf Leser,et al.  Mining phenotypes for gene function prediction , 2008, BMC Bioinformatics.

[25]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[26]  William R. Hersh,et al.  A Survey of Current Work in Biomedical Text Mining , 2005 .

[27]  David Warde-Farley,et al.  GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function , 2008, Genome Biology.

[28]  R. Casadio,et al.  The prediction of protein subcellular localization from sequence: a shortcut to functional genome annotation. , 2008, Briefings in functional genomics & proteomics.

[29]  Vladimir B. Bajic,et al.  Dragon TF Association Miner: a system for exploring transcription factor associations through text-mining , 2004, Nucleic Acids Res..

[30]  Geoffrey J. Barton,et al.  GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes , 2004, BMC Bioinformatics.

[31]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[32]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[33]  Iddo Friedberg,et al.  Automated protein function predictionçthe genomic challenge , 2006 .

[34]  Toshihisa Takagi,et al.  Data and text mining Automatic extraction of gene / protein biological functions from biomedical text , 2005 .

[35]  Hagit Shatkay,et al.  SherLoc2: a high-accuracy hybrid method for predicting subcellular localization of proteins. , 2009, Journal of proteome research.

[36]  Michael J. E. Sternberg,et al.  ConFunc - functional annotation in the twilight zone , 2008, Bioinform..

[37]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[38]  E. Lemyre,et al.  Novel mutations in PAX6, OTX2 and NDP in anophthalmia, microphthalmia and coloboma , 2015, European Journal of Human Genetics.

[39]  Goran Nenadic,et al.  Selecting Text Features for Gene Name Classification: from Documents to Terms , 2003, BioNLP@ACL.

[40]  Iosif I. Vaisman,et al.  SECOST: sequence-conformation-structure database for amino acid residues in proteins , 1999, Bioinform..

[41]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[42]  Gerald Salton,et al.  Automatic text processing , 1988 .

[43]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[44]  Hagit Shatkay,et al.  Pacific Symposium on Biocomputing 13:604-615(2008) EPILOC: A (WORKING) TEXT-BASED SYSTEM FOR PREDICTING PROTEIN SUBCELLULAR LOCATION , 2022 .

[45]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[46]  Jung-Hsien Chiang,et al.  MeKE: Discovering the Functions of Gene Products from Biomedical Literature Via Sentence Alignment , 2003, Bioinform..

[47]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[48]  Paul Horton,et al.  Nucleic Acids Research Advance Access published May 21, 2007 WoLF PSORT: protein localization predictor , 2007 .

[49]  Hagit Shatkay,et al.  Linking Literature, Information, and Knowledge for Biology - Workshop of the BioLink Special Interest Group, ISMB/ECCB 2009, Stockholm, Sweden, June 28-29, 2009, Revised Selected Papers , 2010, BioLINK@ISMB/ECCB.

[50]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[51]  Hagit Shatkay,et al.  Mining the Biomedical Literature , 2012 .

[52]  Burkhard Rost,et al.  Inferring sub-cellular localization through automated lexical analysis , 2002, ISMB.

[53]  Lefteris Angelis,et al.  Gene functional annotation by statistical analysis of biomedical articles , 2007, Int. J. Medical Informatics.

[54]  Günther Zehetner,et al.  OntoBlast function: from sequence similarities directly to potential functional annotations by ontology terms , 2003, Nucleic Acids Res..

[55]  María Martín,et al.  Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..

[56]  Miguel A. Andrade-Navarro,et al.  Information extraction from full text scientific articles: Where are the keywords? , 2003, BMC Bioinformatics.

[57]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .