ANDDigest: a new web-based module of ANDSystem for the search of knowledge in the scientific literature

Background The rapid growth of scientific literature has rendered the task of finding relevant information one of the critical problems in almost any research. Search engines, like Google Scholar, Web of Knowledge, PubMed, Scopus, and others, are highly effective in document search; however, they do not allow knowledge extraction. In contrast to the search engines, text-mining systems provide extraction of knowledge with representations in the form of semantic networks. Of particular interest are tools performing a full cycle of knowledge management and engineering, including automated retrieval, integration, and representation of knowledge in the form of semantic networks, their visualization, and analysis. STRING, Pathway Studio, MetaCore, and others are well-known examples of such products. Previously, we developed the Associative Network Discovery System (ANDSystem), which also implements such a cycle. However, the drawback of these systems is dependence on the employed ontologies describing the subject area, which limits their functionality in searching information based on user-specified queries. Results The ANDDigest system is a new web-based module of the ANDSystem tool, permitting searching within PubMed by using dictionaries from the ANDSystem tool and sets of user-defined keywords. ANDDigest allows performing the search based on complex queries simultaneously, taking into account many types of objects from the ANDSystem’s ontology. The system has a user-friendly interface, providing sorting, visualization, and filtering of the found information, including mapping of mentioned objects in text, linking to external databases, sorting of data by publication date, citations number, journal H-indices, etc. The system provides data on trends for identified entities based on dynamics of interest according to the frequency of their mentions in PubMed by years. Conclusions The main feature of ANDDigest is its functionality, serving as a specialized search for information about multiple associative relationships of objects from the ANDSystem’s ontology vocabularies, taking into account user-specified keywords. The tool can be applied to the interpretation of experimental genetics data, the search for associations between molecular genetics objects, and the preparation of scientific and analytical reviews. It is presently available at https://anddigest.sysbio.ru/.

[1]  Fabio Rinaldi,et al.  Mining of relations between proteins over biomedical scientific literature using a deep-linguistic approach , 2007, Artif. Intell. Medicine.

[2]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[3]  Janos X. Binder,et al.  DISEASES: Text mining and data integration of disease–gene associations , 2014, bioRxiv.

[4]  Vladimir A. Ivanisenko,et al.  ANDSystem: an Associative Network Discovery System for automated literature mining in the field of biology , 2015, BMC Systems Biology.

[5]  Simon M. Lin,et al.  MedlineR: an open source library in R for Medline literature data mining , 2004, Bioinform..

[6]  Jo McEntyre,et al.  The NCBI Handbook , 2002 .

[7]  Razvan C. Bunescu,et al.  Subsequence Kernels for Relation Extraction , 2005, NIPS.

[8]  Jiming Jiang,et al.  Sgt1, but not Rar1, is essential for the RB-mediated broad-spectrum resistance to potato late blight , 2008, BMC Plant Biology.

[9]  Ralf Hofestädt,et al.  shRNA-Induced Knockdown of a Bioinformatically Predicted Target IL10 Influences Functional Parameters in Spontaneously Hypertensive Rats with Asthma , 2018, J. Integr. Bioinform..

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  H. Hopp,et al.  Overexpression of snakin-1 gene enhances resistance to Rhizoctonia solani and Erwinia carotovora in transgenic potato plants. , 2008, Molecular plant pathology.

[12]  Vladimir A Ivanisenko,et al.  NACE: A web-based tool for prediction of intercompartmental efficiency of human molecular genetic networks. , 2016, Virus research.

[13]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[14]  Gang Fu,et al.  PubChem Substance and Compound databases , 2015, Nucleic Acids Res..

[15]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[16]  Vladimir A. Ivanisenko,et al.  Novel candidate genes important for asthma and hypertension comorbidity revealed from associative gene networks , 2018, BMC Medical Genomics.

[17]  Vladimir A. Ivanisenko,et al.  A new version of the ANDSystem tool for automatic extraction of knowledge from scientific publications with expanded functionality for reconstruction of associative gene networks by considering tissue-specific gene expression , 2019, BMC Bioinformatics.

[18]  P. Jacsó As we may search : Comparison of major features of the Web of Science, Scopus, and Google Scholar citation-based and citation-enhanced databases , 2005 .

[19]  V. Ivanisenko,et al.  Mosaic gene network modelling identified new regulatory mechanisms in HCV infection. , 2016, Virus research.

[20]  Malvina Nissim,et al.  Exploiting Context for Biomedical Entity Recognition: From Syntax to the Web , 2004, NLPBA/BioNLP.

[21]  Núria Queralt-Rosinach,et al.  DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants , 2016, Nucleic Acids Res..

[22]  Vladimir A. Ivanisenko,et al.  Molecular association of pathogenetic contributors to pre-eclampsia (pre-eclampsia associome) , 2015, BMC Systems Biology.

[23]  Maryam Habibi,et al.  HUNER: improving biomedical NER with pretraining , 2020, Bioinform..

[24]  Jun'ichi Tsujii,et al.  Feature engineering combined with machine learning and rule-based methods for structured information extraction from narrative clinical discharge summaries , 2012, J. Am. Medical Informatics Assoc..

[25]  Vladimir A. Ivanisenko,et al.  ANDVisio: A new tool for graphic visualization and analysis of literature mined associative gene networks in the ANDSystem , 2012, Silico Biol..

[26]  T. Nikolskaya,et al.  Algorithms for network analysis in systems-ADME/Tox using the MetaCore and MetaDrug platforms , 2006, Xenobiotica; the fate of foreign compounds in biological systems.

[27]  F. X. Chang,et al.  Application of Word Embeddings in Biomedical Named Entity Recognition Tasks , 2015 .

[28]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[29]  Anders Grimvall,et al.  Performance of partial Mann–Kendall tests for trend detection in the presence of covariates , 2002 .

[30]  Vladimir A. Ivanisenko,et al.  Permanent proteins in the urine of healthy humans during the Mars-500 experiment , 2015, J. Bioinform. Comput. Biol..

[31]  Claudio Giuliano,et al.  Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical Literature , 2006, EACL.

[32]  David S. Wishart,et al.  DrugBank: a comprehensive resource for in silico drug discovery and exploration , 2005, Nucleic Acids Res..

[33]  A. Ogoshi,et al.  Ecology and Pathogenicity of Anastomosis and Intraspecific Groups of Rhizoctonia Solani Kuhn , 1987 .

[34]  José Luís Oliveira,et al.  Gimli: open source and high-performance biomedical name recognition , 2013, BMC Bioinformatics.

[35]  Toshihisa Takagi,et al.  Data and text mining Automatic extraction of gene / protein biological functions from biomedical text , 2005 .

[36]  Vladimir A. Ivanisenko,et al.  Insights into pathophysiology of dystropy through the analysis of gene networks: an example of bronchial asthma and tuberculosis , 2014, Immunogenetics.

[37]  V. Ivanisenko,et al.  Novel tuberculosis susceptibility candidate genes revealed by the reconstruction and analysis of associative networks. , 2016, Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases.

[38]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[39]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[40]  T. Nikolskaya,et al.  Biological networks and analysis of experimental data in drug discovery. , 2005, Drug discovery today.

[41]  Thomas Werner,et al.  LitMiner and WikiGene: identifying problem-related key players of gene regulation using publication abstracts , 2005, Nucleic Acids Res..

[42]  G. Adams Thanatephorus cucumeris (Rhizoctonia solani), a species complex of wide host range , 1988 .

[43]  Cathy H. Wu,et al.  RLIMS-P 2.0: A Generalizable Rule-Based Information Extraction System for Literature Mining of Protein Phosphorylation Information , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[44]  Jari Björne,et al.  Biomedical Event Extraction Using Convolutional Neural Networks and Dependency Parsing , 2018, BioNLP.

[45]  S. S. Hirano,et al.  Population Biology and Epidemiology of Pseudomonas Syringae , 1990 .

[46]  M. Coleman,et al.  Hypoxia, hypoxia-inducible factors (HIF), HIF hydroxylases and oxygen sensing , 2009, Cellular and Molecular Life Sciences.

[47]  Wen Qu,et al.  Named Entity Recognition From Biomedical Texts Using a Fusion Attention-Based BiLSTM-CRF , 2019, IEEE Access.

[48]  W. Fry,et al.  Phytophthora infestans: the plant (and R gene) destroyer. , 2008, Molecular plant pathology.

[49]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[50]  Zhihua Liao,et al.  Biomedical Named Entity Recognition Based on Skip-Chain CRFS , 2012, 2012 International Conference on Industrial Control and Electronics Engineering.

[51]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from the literature: Part II , 2005, Bioinform..

[52]  David S. Wishart,et al.  PolySearch2: a significantly improved text-mining system for discovering associations between human diseases, genes, drugs, metabolites, toxins and more , 2015, Nucleic Acids Res..

[53]  K. Bretonnel Cohen,et al.  Getting Started in Text Mining , 2008, PLoS Comput. Biol..

[54]  Thomas G. Smith,et al.  The human side of hypoxia-inducible factor , 2008, British journal of haematology.

[55]  Jun Zhao,et al.  Relation Classification via Convolutional Deep Neural Network , 2014, COLING.

[56]  Makoto Miwa,et al.  End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures , 2016, ACL.

[57]  K. Hipel,et al.  Time series modelling of water resources and environmental systems , 1994 .

[58]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[59]  Jan Gorodkin,et al.  Protein-driven inference of miRNA–disease associations , 2013, Bioinform..

[60]  T. V. Ivanisenko,et al.  Prioritization of genes involved in endothelial cell apoptosis by their implication in lymphedema using an analysis of associative gene networks with ANDSystem , 2019, BMC Medical Genomics.

[61]  W. Alkema,et al.  Application of text mining in the biomedical domain. , 2015, Methods.

[62]  Miguel A. Andrade-Navarro,et al.  Update on XplorMed: a web server for exploring scientific literature , 2003, Nucleic Acids Res..

[63]  T. V. Ivanisenko,et al.  Interactome of the hepatitis C virus: Literature mining with ANDSystem. , 2016, Virus research.

[64]  Jöran Beel,et al.  Google Scholar’s Ranking Algorithm : An Introductory Overview , 2009 .

[65]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[66]  Damian Szklarczyk,et al.  The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible , 2016, Nucleic Acids Res..

[67]  Jian Su,et al.  Recognizing Names in Biomedical Texts: a Machine Learning Approach , 2004 .

[68]  Ulf Leser,et al.  Simple tricks for improving pattern-based information extraction from the biomedical literature , 2010, J. Biomed. Semant..

[69]  V. Ivanisenko,et al.  [SHIFTS IN URINE PROTEIN PROFILE DURING DRY IMMERSION]. , 2015, Aviakosmicheskaia i ekologicheskaia meditsina = Aerospace and environmental medicine.

[70]  T. V. Ivanisenko,et al.  FunGeneNet: a web tool to estimate enrichment of functional interactions in experimental gene sets , 2018, BMC Genomics.

[71]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[72]  Leonardo G. Trabuco,et al.  Negative protein-protein interaction datasets derived from large-scale two-hybrid experiments. , 2012, Methods.

[73]  Jakub Nalepa,et al.  Selecting training sets for support vector machines: a review , 2018, Artificial Intelligence Review.

[74]  G. Viglietto,et al.  Two alternative mRNAs coding for the angiogenic factor, placenta growth factor (PlGF), are transcribed from a single gene of chromosome 14. , 1993, Oncogene.

[75]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[76]  Damian Szklarczyk,et al.  STRING v9.1: protein-protein interaction networks, with increased coverage and integration , 2012, Nucleic Acids Res..

[77]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[78]  N. Kolchanov,et al.  Prediction of tissue-specific effects of gene knockout on apoptosis in different anatomical structures of human brain , 2015, BMC Genomics.

[79]  Kimberly Van Auken,et al.  Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature , 2018, BMC Bioinformatics.

[80]  A. Dandekar,et al.  Agrobacterium tumefaciens as an agent of disease. , 2003, Trends in plant science.

[81]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[82]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[83]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[84]  Wen-Lian Hsu,et al.  NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition , 2006, BMC Bioinformatics.

[85]  Vladimir A Ivanisenko,et al.  Functional divergence of Helicobacter pylori related to early gastric cancer. , 2010, Journal of proteome research.

[86]  F. Rutten,et al.  Heart failure with preserved ejection fraction in women: the Dutch Queen of Hearts program , 2015, Netherlands Heart Journal.

[87]  C. Kwoh,et al.  From Biomedical Literature to Knowledge: Mining Protein-Protein Interactions , 2008, Computational Intelligence in Biomedicine and Bioinformatics.

[88]  K. E. Ravikumar,et al.  Literature mining and database annotation of protein phosphorylation using a rule-based system , 2005, Bioinform..

[89]  Yun Xu,et al.  MinePhos: A Literature Mining System for Protein Phoshphorylation Information Extraction , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[90]  Sergei Egorov,et al.  Pathway studio - the analysis and navigation of molecular networks , 2003, Bioinform..

[91]  J. L. Ding,et al.  Secreted M-Ficolin Anchors onto Monocyte Transmembrane G Protein-Coupled Receptor 43 and Cross Talks with Plasma C-Reactive Protein to Mediate Immune Signaling and Regulate Host Defense , 2010, The Journal of Immunology.

[92]  Christian Stolte,et al.  COMPARTMENTS: unification and visualization of protein subcellular localization evidence , 2014, Database J. Biol. Databases Curation.

[93]  Gary Geunbae Lee,et al.  POSBIOTM-NER in the Shared Task of BioNLP/NLPBA2004 , 2004, NLPBA/BioNLP.