BioTextQuest: a knowledge integration platform for literature mining and concept discovery

SUMMARY The iterative process of finding relevant information in biomedical literature and performing bioinformatics analyses might result in an endless loop for an inexperienced user, considering the exponential growth of scientific corpora and the plethora of tools designed to mine PubMed(®) and related biological databases. Herein, we describe BioTextQuest(+), a web-based interactive knowledge exploration platform with significant advances to its predecessor (BioTextQuest), aiming to bridge processes such as bioentity recognition, functional annotation, document clustering and data integration towards literature mining and concept discovery. BioTextQuest(+) enables PubMed and OMIM querying, retrieval of abstracts related to a targeted request and optimal detection of genes, proteins, molecular functions, pathways and biological processes within the retrieved documents. The front-end interface facilitates the browsing of document clustering per subject, the analysis of term co-occurrence, the generation of tag clouds containing highly represented terms per cluster and at-a-glance popup windows with information about relevant genes and proteins. Moreover, to support experimental research, BioTextQuest(+) addresses integration of its primary functionality with biological repositories and software tools able to deliver further bioinformatics services. The Google-like interface extends beyond simple use by offering a range of advanced parameterization for expert users. We demonstrate the functionality of BioTextQuest(+) through several exemplary research scenarios including author disambiguation, functional term enrichment, knowledge acquisition and concept discovery linking major human diseases, such as obesity and ageing. AVAILABILITY The service is accessible at http://bioinformatics.med.uoc.gr/biotextquest. CONTACT g.pavlopoulos@gmail.com or georgios.pavlopoulos@esat.kuleuven.be SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Raul Rodriguez-Esteban,et al.  Biomedical Text Mining and Its Applications , 2009, PLoS Comput. Biol..

[2]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[3]  Reinhard Schneider,et al.  OnTheFly: a tool for automated document-based text annotation, data linking and network generation , 2009, Bioinform..

[4]  Livia Casciola-Rosen,et al.  Selective cleavage of nucleolar autoantigen B23 by granzyme B in differentiated vascular smooth muscle cells: insights into the association of specific autoantibodies with distinct disease phenotypes. , 2004, Arthritis and rheumatism.

[5]  Damian Szklarczyk,et al.  The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored , 2010, Nucleic Acids Res..

[6]  Georgios A. Pavlopoulos,et al.  Caipirini: using gene sets to rank literature , 2012, BioData Mining.

[7]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[8]  Haruki Nakamura,et al.  The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data , 2006, Nucleic Acids Res..

[9]  Philip E. Bourne,et al.  BioLit: integrating biological literature with databases , 2008, Nucleic Acids Res..

[10]  Christos A. Ouzounis,et al.  BioTextQuest: a web-based biomedical text mining suite for concept discovery , 2011, Bioinform..

[11]  Zukang Feng,et al.  Ligand Depot: a data warehouse for ligands bound to macromolecules , 2004, Bioinform..

[12]  David S. Wishart,et al.  Nucleic Acids Research Polysearch: a Web-based Text Mining System for Extracting Relationships between Human Diseases, Genes, Mutations, Drugs Polysearch: a Web-based Text Mining System for Extracting Relationships between Human Diseases, Genes, Mutations, Drugs and Metabolites , 2008 .

[13]  Alfred D. Eaton,et al.  HubMed: a web-based biomedical literature search interface , 2006, Nucleic Acids Res..

[14]  Yanli Wang,et al.  PubChem: Integrated Platform of Small Molecules and Biological Activities , 2008 .

[15]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[16]  Kevin W. Boyack,et al.  Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches , 2011, PloS one.

[17]  E. Birney,et al.  The International Protein Index: An integrated database for proteomics experiments , 2004, Proteomics.

[18]  Michael J. Lush,et al.  genenames.org: the HGNC resources in 2011 , 2010, Nucleic Acids Res..

[19]  Theodoros G. Soldatos,et al.  Mining cell literature using support vector machines , 2012 .

[20]  Sophia Ananiadou,et al.  Discovering and visualizing indirect associations between biomedical concepts , 2011, Bioinform..

[21]  Shawn M. Douglas,et al.  PubNet: a flexible system for visualizing literature derived networks , 2005, Genome Biology.

[22]  Cheng-Ming Chuong,et al.  Pubfocus: Semantic Medline/pubmed Citations Analytics through Integration of Controlled Biomedical Dictionaries and Ranking Algorithm Pubfocus:semanticmedline/pubmedcitations Analyticsthroughintegrationofcontrolledbiomedical Dictionariesandrankingalgorithm , 2022 .

[23]  Georgios A. Pavlopoulos,et al.  Mining Cell Cycle Literature Using Support Vector Machines , 2012, SETN.

[24]  P G Pelicci,et al.  Nucleophosmin and its complex network: a possible therapeutic target in hematological diseases , 2011, Oncogene.

[25]  Elspeth A. Bruford,et al.  Genenames.org: the HGNC resources in 2015 , 2014, Nucleic Acids Res..

[26]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[27]  Mounir Errami,et al.  eTBLAST: a web server to identify expert reviewers, appropriate journals and similar publications , 2007, Nucleic Acids Res..

[28]  Chris Sander,et al.  The HSSP database of protein structure-sequence alignments , 1996, Nucleic Acids Res..

[29]  Jacques van Helden,et al.  Network Analysis Tools: from biological networks to clusters and pathways , 2008, Nature Protocols.

[30]  T. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2010, Nucleic Acids Res..

[31]  Rodrigo Lopez,et al.  Petabyte-scale innovations at the European Nucleotide Archive , 2008, Nucleic Acids Res..

[32]  Ole Winther,et al.  JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update , 2007, Nucleic Acids Res..

[33]  Russ B. Altman,et al.  MScanner: a classifier for retrieving Medline citations , 2008, BMC Bioinformatics.

[34]  Gary D. Bader,et al.  clusterMaker: a multi-algorithm clustering plugin for Cytoscape , 2011, BMC Bioinformatics.

[35]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..

[36]  Dietrich Rebholz-Schuhmann,et al.  MedEvi: Retrieving textual evidence of relations between biomedical concepts from Medline , 2008, Bioinform..

[37]  David S. Wishart,et al.  HMDB: a knowledgebase for the human metabolome , 2008, Nucleic Acids Res..

[38]  Karsten Hokamp,et al.  PubCrawler: keeping up comfortably with PubMed and GenBank , 2004, Nucleic Acids Res..

[39]  Tsviya Olender,et al.  GeneCards Version 3: the human gene integrator , 2010, Database J. Biol. Databases Curation.

[40]  Pier Paolo Pandolfi,et al.  Nucleophosmin and cancer , 2006, Nature Reviews Cancer.

[41]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[42]  F. Pasquier,et al.  A new polymorphism in the APOE promoter associated with risk of developing Alzheimer's disease. , 1998, Human molecular genetics.

[43]  A. Valencia,et al.  Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge , 2008, Genome Biology.

[44]  L. Grivell,et al.  Text mining for biology - the way forward: opinions from leading scientists , 2008, Genome Biology.

[45]  Michele Magrane,et al.  UniProt Knowledgebase: a hub of integrated protein data , 2011, Database J. Biol. Databases Curation.

[46]  Zhiyong Lu,et al.  PubMed and beyond: a survey of web tools for searching biomedical literature , 2011, Database J. Biol. Databases Curation.

[47]  Ioannis Iliopoulos,et al.  Biological information extraction and co-occurrence analysis. , 2014, Methods in molecular biology.

[48]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..

[49]  Reinhard Schneider,et al.  OnTheFly 2.0: A tool for automatic annotation of files and biological information extraction , 2013, 13th IEEE International Conference on BioInformatics and BioEngineering.

[50]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[51]  Tamás Nepusz,et al.  SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale , 2010, BMC Bioinformatics.

[52]  Mary Goldman,et al.  The UCSC Genome Browser database: update 2011 , 2010, Nucleic Acids Res..

[53]  Gautier Koscielny,et al.  Ensembl 2012 , 2011, Nucleic Acids Res..

[54]  David S. Wishart,et al.  DrugBank: a knowledgebase for drugs, drug actions and drug targets , 2007, Nucleic Acids Res..

[55]  Nicholas C. Ide,et al.  The ClinicalTrials.gov results database--update and key issues. , 2011, The New England journal of medicine.

[56]  Dietrich Rebholz-Schuhmann,et al.  EBIMed - text crunching to gather facts for proteins from Medline , 2007, Bioinform..

[57]  E. Giglia Quertle and KNALIJ: searching PubMed has never been so easy and effective. , 2011, European journal of physical and rehabilitation medicine.

[58]  P. Bork,et al.  Co-evolution of transcriptional and post-translational cell-cycle regulation , 2006, Nature.

[59]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[60]  H. Dickinson,et al.  Telomere length predicts poststroke mortality, dementia, and cognitive decline , 2006, Annals of neurology.

[61]  Reinhard Schneider,et al.  Which clustering algorithm is better for predicting protein complexes? , 2011, BMC Research Notes.

[62]  Christian von Mering,et al.  STITCH: interaction networks of chemicals and proteins , 2007, Nucleic Acids Res..

[63]  Reinhard Schneider,et al.  Using graph theory to analyze biological networks , 2011, BioData Mining.

[64]  Xin He,et al.  BSQA: integrated text mining using entity relation semantics extracted from biological literature of insects , 2010, Nucleic Acids Res..

[65]  Albert J. Vilella,et al.  EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. , 2009, Genome research.

[66]  G. Schuler Pieces of the puzzle: expressed sequence tags and the catalog of human genes , 1997, Journal of Molecular Medicine.

[67]  Lefteris Angelis,et al.  PuReD-MCL: a graph-based PubMed document clustering methodology , 2008, Bioinform..

[68]  Sophia Ananiadou,et al.  FACTA: a text search engine for finding associated biomedical concepts , 2008, Bioinform..

[69]  Anton J. Enright,et al.  TEXTQUEST: Document Clustering of MEDLINE Abstracts For Concept Discovery In Molecular Biology , 2000, Pacific Symposium on Biocomputing.

[70]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[71]  Wei Zhou,et al.  Anne O'Tate: A tool to support user-driven summarization, drill-down and browsing of PubMed search results , 2008, Journal of biomedical discovery and collaboration.

[72]  Xiaohui Xie,et al.  Interactive and fuzzy search: a dynamic way to explore MEDLINE , 2010, Bioinform..

[73]  Robert B. Russell,et al.  SuperTarget and Matador: resources for exploring drug-target relationships , 2007, Nucleic Acids Res..

[74]  Michael Kuhn,et al.  Reflect: augmented browsing for the life scientist , 2009, Nature Biotechnology.

[75]  Lieu Tl,et al.  Obesity, cigarette smoking, and telomere length in women , 2005 .

[76]  David Haussler,et al.  The UCSC Genome Browser database: update 2010 , 2009, Nucleic Acids Res..

[77]  Peer Bork,et al.  Exploring MEDLINE abstracts with XplorMed. , 2002, Drugs of today.

[78]  E M Wijsman,et al.  Interactions of apolipoprotein E genotype, total cholesterol level, age, and sex in prediction of Alzheimer's disease , 1995, Neurology.

[79]  James A. Casbon,et al.  Spectral clustering of protein sequences , 2006, Nucleic acids research.

[80]  Yasunori Yamamoto,et al.  Biomedical knowledge navigation by literature clustering , 2007, J. Biomed. Informatics.

[81]  David J. States,et al.  MiSearch adaptive pubMed search tool , 2009, Bioinform..

[82]  Seán I. O'Donoghue,et al.  The PSSH database of alignments between protein sequences and tertiary structures , 2003, Nucleic Acids Res..

[83]  Reinhard Schneider,et al.  jClust: a clustering and visualization toolbox , 2009, Bioinform..

[84]  Martin H. Schaefer,et al.  MedlineRanker: flexible ranking of biomedical literature , 2009, Nucleic Acids Res..