The Treasury Chest of Text Mining: Piling Available Resources for Powerful Biomedical Text Mining

Text mining (TM) is a semi-automatized, multi-step process, able to turn unstructured into structured data. TM relevance has increased upon machine learning (ML) and deep learning (DL) algorithms’ application in its various steps. When applied to biomedical literature, text mining is named biomedical text mining and its specificity lies in both the type of analyzed documents and the language and concepts retrieved. The array of documents that can be used ranges from scientific literature to patents or clinical data, and the biomedical concepts often include, despite not being limited to genes, proteins, drugs, and diseases. This review aims to gather the leading tools for biomedical TM, summarily describing and systematizing them. We also surveyed several resources to compile the most valuable ones for each category.

[1]  David N. Nicholson,et al.  Constructing knowledge graphs and their biomedical applications , 2020, Computational and structural biotechnology journal.

[2]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[3]  Fabio Rinaldi,et al.  OntoGene web services for biomedical text mining , 2014, BMC Bioinformatics.

[4]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[5]  Takashi Gojobori,et al.  IBDDB: a manually curated and text-mining-enhanced database of genes involved in inflammatory bowel disease , 2021, Database J. Biol. Databases Curation.

[6]  Zhiyong Lu,et al.  tmChem: a high performance approach for chemical named entity recognition and normalization , 2015, Journal of Cheminformatics.

[7]  W. Alkema,et al.  Application of text mining in the biomedical domain. , 2015, Methods.

[8]  C. Bai,et al.  Regulatory Mechanisms of Coicis Semen on Bionetwork of Liver Cancer Based on Network Pharmacology , 2020, BioMed research international.

[9]  Hongfang Liu,et al.  Natural language processing of radiology reports for identification of skeletal site-specific fractures , 2019, BMC Medical Informatics and Decision Making.

[10]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[11]  Martín Pérez-Pérez,et al.  Online visibility of software-related web sites: The case of biomedical text mining tools , 2019, Inf. Process. Manag..

[12]  Richard Tzong-Han Tsai,et al.  From Entity Recognition to Entity Linking: A Survey of Advanced Entity Linking Techniques (人工知能学会全国大会(第26回)文化,科学技術と未来) -- (International Organized Session「Special Session on Web Intelligence & Data Mining」) , 2012 .

[13]  David S. Wishart,et al.  DrugBank: a knowledgebase for drugs, drug actions and drug targets , 2007, Nucleic Acids Res..

[14]  Jaewoo Kang,et al.  CollaboNet: collaboration of deep neural networks for biomedical named entity recognition , 2018, BMC Bioinformatics.

[15]  Casey S. Greene,et al.  Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery , 2015, Briefings Bioinform..

[16]  Jie Luo,et al.  BioRel: towards large-scale biomedical relation extraction , 2020, BMC Bioinformatics.

[17]  Steven Bethard,et al.  ClearTK 2.0: Design Patterns for Machine Learning in UIMA , 2014, LREC.

[18]  Francisco M. Couto,et al.  Text Mining for Bioinformatics Using Biomedical Literature , 2019, Encyclopedia of Bioinformatics and Computational Biology.

[19]  Damian Szklarczyk,et al.  STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets , 2018, Nucleic Acids Res..

[20]  Leon Weber,et al.  PEDL: extracting protein–protein associations using deep language models and distant supervision , 2020, Bioinform..

[21]  Zhiyong Lu,et al.  DNorm: disease name normalization with pairwise learning to rank , 2013, Bioinform..

[22]  Zhiyong Lu,et al.  GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains , 2015, BioMed research international.

[23]  Hung-Yu Kao,et al.  Cross-species gene normalization by species inference , 2011, BMC Bioinformatics.

[24]  Karin M. Verspoor,et al.  Annotating the biomedical literature for the human variome , 2013, Database J. Biol. Databases Curation.

[25]  Paloma Martínez,et al.  The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions , 2013, J. Biomed. Informatics.

[26]  Sampo Pyysalo,et al.  Event extraction across multiple levels of biological organization , 2012, Bioinform..

[27]  Peter Szolovits,et al.  REflex: Flexible Framework for Relation Extraction in Multiple Domains , 2019, BioNLP@ACL.

[28]  Cheng Zhang,et al.  Biomedical text mining and its applications in cancer research , 2013, J. Biomed. Informatics.

[29]  Zhiyong Lu,et al.  tmBioC: improving interoperability of text-mining tools with BioC , 2014, Database J. Biol. Databases Curation.

[30]  David S. Goodsell,et al.  RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences , 2020, Nucleic Acids Res..

[31]  Maryam Habibi,et al.  Deep learning with word embeddings improves biomedical named entity recognition , 2017, Bioinform..

[32]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[33]  Daniel King,et al.  ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing , 2019, BioNLP@ACL.

[34]  Zhiyong Lu,et al.  Text Mining for Drug Discovery. , 2019, Methods in molecular biology.

[35]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[36]  Beisi Xu,et al.  CovidExpress: an interactive portal for intuitive investigation on SARS-CoV-2 related transcriptomes , 2021, bioRxiv.

[37]  Maryam Habibi,et al.  HunFlair: an easy-to-use tool for state-of-the-art biomedical named entity recognition , 2020, Bioinform..

[38]  Junichi Tsujii,et al.  Event extraction for systems biology by text mining the literature. , 2010, Trends in biotechnology.

[39]  Kalina Bontcheva,et al.  Getting More Out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics , 2013, PLoS Comput. Biol..

[40]  B. Furht,et al.  Deep Learning applications for COVID-19 , 2021, Journal of Big Data.

[41]  Feng Zhu,et al.  Therapeutic target database 2020: enriched resource for facilitating research and early development of targeted therapeutics , 2019, Nucleic Acids Res..

[42]  Gary D. Bader,et al.  Transfer learning for biomedical named entity recognition with neural networks , 2018, bioRxiv.

[43]  Fang Liu,et al.  Data Processing and Text Mining Technologies on Electronic Medical Records: A Review , 2018, Journal of healthcare engineering.

[44]  Karin M. Verspoor,et al.  A UIMA wrapper for the NCBO annotator , 2010, Bioinform..

[45]  Jeffrey D Saffer,et al.  Introduction to biomedical literature text mining: context and objectives. , 2014, Methods in molecular biology.

[46]  Zhiyong Lu,et al.  NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition , 2021, J. Biomed. Informatics.

[47]  Pengtao Xie,et al.  Effective Use of Bidirectional Language Modeling for Medical Named Entity Recognition , 2017, ArXiv.

[48]  Jaewoo Kang,et al.  BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations , 2016, Database J. Biol. Databases Curation.

[49]  Miguel A. Andrade-Navarro,et al.  Information extraction from full text scientific articles: Where are the keywords? , 2003, BMC Bioinformatics.

[50]  Matthias Dehmer,et al.  Named Entity Recognition and Relation Detection for Biomedical Information Extraction , 2020, Frontiers in Cell and Developmental Biology.

[51]  Yonghwa Choi,et al.  A Neural Named Entity Recognition and Multi-Type Normalization Tool for Biomedical Text Mining , 2019, IEEE Access.

[52]  Robert B. Russell,et al.  SuperTarget and Matador: resources for exploring drug-target relationships , 2007, Nucleic Acids Res..

[53]  Ziqi Zhang,et al.  Semantic Relatedness Approach for Named Entity Disambiguation , 2010, IRCDL.

[54]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[55]  Ulf Leser,et al.  SETH detects and normalizes genetic variants in text , 2016, Bioinform..

[56]  N. Perrimon,et al.  BioLitMine: Advanced Mining of Biomedical and Biological Literature About Human Genes and Genes from Major Model Organisms , 2020, G3.

[57]  Yu Hu,et al.  BioMuta and BioXpress: mutation and expression knowledgebases for cancer biomarker discovery , 2017, Nucleic Acids Res..

[58]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[59]  Thomas C. Wiegers,et al.  CTD anatomy: Analyzing chemical-induced phenotypes and exposures from an anatomical perspective, with implications for environmental health studies , 2021, Current research in toxicology.

[60]  Jong C. Park,et al.  CoMAGC: a corpus with multi-faceted annotations of gene-cancer relations , 2013, BMC Bioinformatics.

[61]  Suzanne M. Paley,et al.  The BioCyc collection of microbial genomes and metabolic pathways , 2019, Briefings Bioinform..

[62]  Dat Quoc Nguyen,et al.  VinAI at ChEMU 2020: An Accurate System for Named Entity Recognition in Chemical Reactions from Patents , 2020, CLEF.

[63]  Betsy L. Humphreys,et al.  Technical Milestone: The Unified Medical Language System: An Informatics Research Collaboration , 1998, J. Am. Medical Informatics Assoc..

[64]  Sophia Ananiadou,et al.  Construction of an annotated corpus to support biomedical information extraction , 2009, BMC Bioinformatics.

[65]  Maryam Habibi,et al.  HUNER: improving biomedical NER with pretraining , 2020, Bioinform..

[66]  Francisco M. Couto,et al.  A Silver Standard Corpus of Human Phenotype-Gene Relations , 2019, NAACL.

[67]  Hongfang Liu,et al.  BioThesaurus: a web-based thesaurus of protein and gene names , 2006, Bioinform..

[68]  Srikumar Venugopal,et al.  Big Data and Cross-Document Coreference Resolution: Current State and Future Opportunities , 2013, ArXiv.

[69]  Sanja Seljan,et al.  Information retrieval and terminology extraction in online resources for patients with diabetes. , 2014, Collegium antropologicum.

[70]  José Luís Oliveira,et al.  BeCAS: biomedical concept recognition services and visualization , 2013, Bioinform..

[71]  B. Roth,et al.  The Multiplicity of Serotonin Receptors: Uselessly Diverse Molecules or an Embarrassment of Riches? , 2000 .

[72]  Zhiyong Lu,et al.  Exploring Semi-supervised Variational Autoencoders for Biomedical Relation Extraction , 2019, Methods.

[73]  Yong Wang,et al.  Social Media Text Mining Framework for Drug Abuse: Development and Validation Study With an Opioid Crisis Case Analysis , 2020, Journal of medical Internet research.

[74]  Gary B. Wilcox,et al.  Public reactions to e-cigarette regulations on Twitter: a text mining analysis , 2017, Tobacco Control.

[75]  Satoshi Niijima,et al.  GLIDA: GPCR—ligand database for chemical genomics drug discovery—database and tools update , 2007, Nucleic Acids Res..

[76]  Richard Tzong-Han Tsai,et al.  Overview of BioCreative II gene mention recognition , 2008, Genome Biology.

[77]  Dongmei Li,et al.  Bon-EV: an improved multiple testing procedure for controlling false discovery rates , 2017, BMC Bioinformatics.

[78]  Jingbo Xia,et al.  A review of drug knowledge discovery using BioNLP and tensor or matrix decomposition , 2019, Genomics & informatics.

[79]  Angus Roberts,et al.  Using Prior Information from the Medical Literature in GWAS of Oral Cancer Identifies Novel Susceptibility Variant on Chromosome 4 - the AdAPT Method , 2012, PloS one.

[80]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[81]  Christina L. Radloff,et al.  Mining twitter to explore the emergence of COVID-19 symptoms. , 2020, Public health nursing.

[82]  Goran Nenadic,et al.  Deep learning meets ontologies: experiments to anchor the cardiovascular disease ontology in the biomedical literature , 2018, Journal of Biomedical Semantics.

[83]  Ying Zhang,et al.  HMDB: the Human Metabolome Database , 2007, Nucleic Acids Res..

[84]  Alfonso Valencia,et al.  Annotation Process, Guidelines and Text Corpus of Small Non-Coding RNA Molecules: the MiNCor for MicroRNA Annotations , 2016, SMBM.

[85]  Gary D Bader,et al.  Towards reliable named entity recognition in the biomedical domain , 2020, Bioinform..

[86]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[87]  Tu Bao Ho,et al.  A nucleosomal approach to inferring causal relationships of histone modifications , 2014, BMC Genomics.

[88]  Zhiyong Lu,et al.  tmVar: a text mining approach for extracting sequence variants in biomedical literature , 2013, Bioinform..

[89]  Carlos Angel Iglesias,et al.  Exploiting semantic similarity for named entity disambiguation in knowledge graphs , 2018, Expert Syst. Appl..

[90]  K. Bretonnel Cohen,et al.  The Colorado Richly Annotated Full Text (CRAFT) Corpus: Multi-Model Annotation in the Biomedical Domain , 2017 .

[91]  Damian Szklarczyk,et al.  STITCH 5: augmenting protein–chemical interaction networks with tissue and affinity data , 2015, Nucleic Acids Res..

[92]  Cristina Marino Buslje,et al.  The articles.ELM resource: simplifying access to protein linear motif literature by annotation, text-mining and classification , 2020, Database J. Biol. Databases Curation.

[93]  Sandra Collovini,et al.  A review on Relation Extraction with an eye on Portuguese , 2013, Journal of the Brazilian Computer Society.

[94]  Zhiyong Lu,et al.  The CHEMDNER corpus of chemicals and drugs and its annotation principles , 2015, Journal of Cheminformatics.

[95]  Dustin Wright,et al.  NormCo: Deep Disease Normalization for Biomedical Knowledge Base Construction , 2019, AKBC.

[96]  Usman Qamar,et al.  A Relation Extraction Framework for Biomedical Text Using Hybrid Feature Set , 2015, Comput. Math. Methods Medicine.

[97]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[98]  Jie Lu,et al.  Profiling COVID-19 Genetic Research: A Data-Driven Study Utilizing Intelligent Bibliometrics , 2021, Frontiers in Research Metrics and Analytics.

[99]  Goran Nenadic,et al.  An Exploration of Mining Gene Expression Mentions and Their Anatomical Locations from Biomedical Text , 2010, BioNLP@ACL.

[100]  Andrew R. Leach,et al.  ChEMBL: towards direct deposition of bioassay data , 2018, Nucleic Acids Res..

[101]  Kai Zheng,et al.  How Do General-Purpose Sentiment Analyzers Perform when Applied to Health-Related Online Social Media Data? , 2019, MedInfo.

[102]  Zhiyong Lu,et al.  SR4GN: A Species Recognition Software Tool for Gene Normalization , 2012, PloS one.

[103]  A. Bennaceur-Griscelli,et al.  HLA-dependent heterogeneity and macrophage immunoproteasome activation during lung COVID-19 disease , 2021, Journal of Translational Medicine.

[104]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[105]  Donald C. Comeau,et al.  NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature , 2021, Scientific data.

[106]  Kara Dolinski,et al.  The BioGRID interaction database: 2019 update , 2018, Nucleic Acids Res..

[107]  Hyun-Seok Park,et al.  GNI Corpus Version 1.0: Annotated Full-Text Corpus of Genomics & Informatics to Support Biomedical Information Extraction , 2018, Genomics & informatics.

[108]  Manabu Torii,et al.  RLIMS-P: an online text-mining tool for literature-based extraction of protein phosphorylation information , 2014, Database J. Biol. Databases Curation.

[109]  Sophia Ananiadou,et al.  Discovering and visualizing indirect associations between biomedical concepts , 2011, Bioinform..

[110]  Hung-Yu Kao,et al.  AuDis: an automatic CRF-enhanced disease normalization in biomedical text , 2016, Database J. Biol. Databases Curation.

[111]  Kara Dolinski,et al.  The BioC-BioGRID corpus: full text articles annotated for curation of protein–protein and genetic interactions , 2017, Database J. Biol. Databases Curation.

[112]  Hyunju Lee,et al.  A method for named entity normalization in biomedical articles: application to diseases and plants , 2017, BMC Bioinformatics.

[113]  Lejla Turulja,et al.  Text Mining for Big Data Analysis in Financial Sector: A Literature Review , 2019, Sustainability.

[114]  José Luís Oliveira,et al.  Biomedical Named Entity Recognition: A Survey of Machine-Learning Tools , 2012 .

[115]  Takahiro Hara,et al.  Entity Disambiguation based on a Probabilistic Taxonomy , 2011 .

[116]  Yang Zhang,et al.  Dataset-aware multi-task learning approaches for biomedical named entity recognition , 2020, Bioinform..

[117]  Sriparna Saha,et al.  Relation Extraction From Biomedical and Clinical Text: Unified Multitask Learning Framework , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[118]  Chris Morris,et al.  Automatic annotation of protein residues in published papers. , 2019, Acta crystallographica. Section F, Structural biology communications.

[119]  Karin M. Verspoor,et al.  BioC: a minimalist approach to interoperability for biomedical text processing , 2013, AMIA.

[120]  Xiaolong Wang,et al.  CNN-based ranking for biomedical entity normalization , 2017, BMC Bioinformatics.

[121]  David S. Wishart,et al.  PolySearch2: a significantly improved text-mining system for discovering associations between human diseases, genes, drugs, metabolites, toxins and more , 2015, Nucleic Acids Res..

[122]  Cathy H. Wu,et al.  DEXTER: Disease-Expression Relation Extraction from Text , 2018, Database J. Biol. Databases Curation.

[123]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[124]  K. Bretonnel Cohen,et al.  Concept annotation in the CRAFT corpus , 2012, BMC Bioinformatics.

[125]  Thomas C. Rindflesch,et al.  MedTag: A Collection of Biomedical Annotations , 2005, LBLODMBS@IDMB.

[126]  Fei Li,et al.  A neural joint model for entity and relation extraction from biomedical text , 2017, BMC Bioinformatics.

[127]  David S. Wishart,et al.  DrugBank 5.0: a major update to the DrugBank database for 2018 , 2017, Nucleic Acids Res..

[128]  Sophia Ananiadou,et al.  FACTA: a text search engine for finding associated biomedical concepts , 2008, Bioinform..

[129]  Robert Leaman,et al.  PubTator central: automated concept annotation for biomedical full text articles , 2019, Nucleic Acids Res..

[130]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[131]  José Luís Oliveira,et al.  Egas: a collaborative and interactive document curation platform , 2014, Database J. Biol. Databases Curation.

[132]  Thanh Hai Dang,et al.  D3NER: biomedical named entity recognition using CRF‐biLSTM improved with fine‐tuned embeddings of various linguistic information , 2018, Bioinform..

[133]  Alfonso Valencia,et al.  CHEMDNER: The drugs and chemical names extraction challenge , 2015, Journal of Cheminformatics.

[134]  F. Sanz,et al.  The DisGeNET knowledge platform for disease genomics: 2019 update , 2019, Nucleic Acids Res..

[135]  Minoru Kanehisa,et al.  KEGG: integrating viruses and cellular organisms , 2020, Nucleic Acids Res..

[136]  Anna-Lena Lamprecht,et al.  Community curation of bioinformatics software and data resources , 2019, Briefings Bioinform..

[137]  Stéfan Darmoni,et al.  Cimind: A phonetic-based tool for multilingual named entity recognition in biomedical texts , 2019, J. Biomed. Informatics.

[138]  Núria Queralt-Rosinach,et al.  Linked Registries: Connecting Rare Diseases Patient Registries through a Semantic Web Layer , 2017, BioMed research international.