DISNET: a framework for extracting phenotypic disease information from public sources

Background Within the global endeavour of improving population health, one major challenge is the identification and integration of medical knowledge spread through several information sources. The creation of a comprehensive dataset of diseases and their clinical manifestations based on information from public sources is an interesting approach that allows one not only to complement and merge medical knowledge but also to increase it and thereby to interconnect existing data and analyse and relate diseases to each other. In this paper, we present DISNET (http://disnet.ctb.upm.es/), a web-based system designed to periodically extract the knowledge from signs and symptoms retrieved from medical databases, and to enable the creation of customisable disease networks. Methods We here present the main features of the DISNET system. We describe how information on diseases and their phenotypic manifestations is extracted from Wikipedia and PubMed websites; specifically, texts from these sources are processed through a combination of text mining and natural language processing techniques. Results We further present the validation of our system on Wikipedia and PubMed texts, obtaining the relevant accuracy. The final output includes the creation of a comprehensive symptoms-disease dataset, shared (free access) through the system’s API. We finally describe, with some simple use cases, how a user can interact with it and extract information that could be used for subsequent analyses. Discussion DISNET allows retrieving knowledge about the signs, symptoms and diagnostic tests associated with a disease. It is not limited to a specific category (all the categories that the selected sources of information offer us) and clinical diagnosis terms. It further allows to track the evolution of those terms through time, being thus an opportunity to analyse and observe the progress of human knowledge on diseases. We further discussed the validation of the system, suggesting that it is good enough to be used to extract diseases and diagnostically-relevant terms. At the same time, the evaluation also revealed that improvements could be introduced to enhance the system’s reliability.

[1]  Guang Zheng,et al.  Text Mining of Rheumatoid Arthritis and Diabetes Mellitus to Understand the Mechanisms of Chinese Medicine in Different Diseases with Same Treatment , 2018, Chinese Journal of Integrative Medicine.

[2]  A. Barabasi,et al.  Network medicine : a network-based approach to human disease , 2010 .

[3]  Tsviya Olender,et al.  GeneCardsTM 2002: towards a complete, object-oriented, human gene compendium , 2002, Bioinform..

[4]  Xiang Zhang,et al.  Comparative analysis of a novel disease phenotype network based on clinical manifestations , 2015, J. Biomed. Informatics.

[5]  Xuezhong Zhou,et al.  Network Based Integrated Analysis of Phenotype-Genotype Data for Prioritization of Candidate Symptom Genes , 2014, BioMed research international.

[6]  Norman J. Joy Temple,et al.  How Accurate Are Wikipedia Articles in Health, Nutrition, and Medicine? / Les articles de Wikipédia dans les domaines de la santé, de la nutrition et de la médecine sont-ils exacts ? , 2014 .

[7]  Rohan Patankar,et al.  Wikipedia vs Peer-Reviewed Medical Literature for Information About the 10 Most Costly Medical Conditions , 2014, The Journal of the American Osteopathic Association.

[8]  Sunmo Yang,et al.  HumanNet v2: human gene networks for disease research , 2018, Nucleic Acids Res..

[9]  Georges Badr,et al.  Medical Data Mining for Heart Diseases and the Future of Sequential Mining in Medical Field , 2018, Machine Learning Paradigms.

[10]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[11]  D. Botstein,et al.  Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease , 2003, Nature Genetics.

[12]  Finn Årup Nielsen,et al.  Excavating the mother lode of human-generated text: A systematic review of research that uses the wikipedia corpus , 2017, Inf. Process. Manag..

[13]  LopesPedro,et al.  An innovative portal for rare genetic diseases research , 2013 .

[14]  H. Potts,et al.  Motivations for Contributing to Health-Related Articles on Wikipedia: An Interview Study , 2013, Journal of medical Internet research.

[15]  Savino Sciascia,et al.  What can Google and Wikipedia can tell us about a disease? Big Data trends analysis in Systemic Lupus Erythematosus , 2017, Int. J. Medical Informatics.

[16]  P. Aloy,et al.  A network medicine approach to human disease , 2009, FEBS letters.

[17]  José Luís Oliveira,et al.  Integration of Genetic and Medical Information Through a Web Crawler System , 2005, ISBMDA.

[18]  Alison J. Head,et al.  How Today's College Students use Wikipedia for Course-related Research , 2010, First Monday.

[19]  Y. Liao,et al.  Contemporary Mitral Valve Surgery for Septuagenarians and Octogenarians , 2017 .

[20]  Livia Perfetto,et al.  DISNOR: a disease network open resource , 2017, Nucleic Acids Res..

[21]  A. Barabasi,et al.  Human symptoms–disease network , 2014, Nature Communications.

[22]  David Matheson,et al.  Wikipedia as Informal Self-Education for Clinical Decision-Making in Medical Practice , 2017 .

[23]  Jake Orlowitz,et al.  Why Medical Schools Should Embrace Wikipedia: Final-Year Medical Student Contributions to Wikipedia Articles for Academic Credit at One School , 2016, Academic medicine : journal of the Association of American Medical Colleges.

[24]  Janos X. Binder,et al.  DISEASES: Text mining and data integration of disease–gene associations , 2014, bioRxiv.

[25]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[26]  Heather Murray More than 2 billion pairs of eyeballs: Why aren’t you sharing medical knowledge on Wikipedia? , 2018, BMJ Evidence-Based Medicine.

[27]  Ernestina Menasalvas Ruiz,et al.  Diagnostic Knowledge Extraction from MedlinePlus: An Application for Infectious Diseases , 2015, PACBB.

[28]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[29]  Tudor Groza,et al.  The Human Phenotype Ontology in 2017 , 2016, Nucleic Acids Res..

[30]  Muhannad Quwaider,et al.  Social Networks Benchmark Dataset for Diseases Classification , 2016, 2016 IEEE 4th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW).

[31]  Thomas Tran,et al.  A Machine Learning Approach for Identifying Disease-Treatment Relations in Short Texts , 2011, IEEE Transactions on Knowledge and Data Engineering.

[32]  José Luís Oliveira,et al.  An innovative portal for rare genetic diseases research: The semantic Diseasecard , 2013, J. Biomed. Informatics.

[33]  Núria Queralt-Rosinach,et al.  DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants , 2016, Nucleic Acids Res..

[34]  Raja Mazumder,et al.  DiMeX: A Text Mining System for Mutation-Disease Association Extraction , 2016, PloS one.

[35]  Ernestina Menasalvas Ruiz,et al.  Disease networks and their contribution to disease understanding: A review of their evolution, techniques and data sources , 2019, J. Biomed. Informatics.

[36]  Mark D. Wilkinson,et al.  Extracting Diagnostic Knowledge from MedLine Plus: A Comparison between MetaMap and cTAKES Approaches , 2017, Current Bioinformatics.

[37]  L. Castagnoli,et al.  mentha: a resource for browsing integrated protein-interaction networks , 2013, Nature Methods.

[38]  Mangal Sain,et al.  A text mining approach to identify the relationship between gait-Parkinson's disease (PD) from PD based research articles , 2017, 2017 International Conference on Inventive Computing and Informatics (ICICI).

[39]  Huan Liu,et al.  Evaluating the trustworthiness of Wikipedia articles through quality and credibility , 2009, Int. Sym. Wikis.

[40]  A. Barabasi,et al.  The human disease network , 2007, Proceedings of the National Academy of Sciences.

[41]  Yue Ming,et al.  PedAM: a database for Pediatric Disease Annotation and Medicine , 2017, Nucleic Acids Res..

[42]  Richard Hodson Wikipedians reach out to academics , 2015, Nature.

[43]  P. Stenson,et al.  The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine , 2013, Human Genetics.

[44]  Andrew G. West,et al.  Wikipedia and Medicine: Quantifying Readership, Editors, and the Significance of Natural Language , 2015, Journal of medical Internet research.

[45]  Samy A Azer,et al.  Evaluation of gastroenterology and hepatology articles on Wikipedia: Are they suitable as learning resources for medical students? , 2014, European journal of gastroenterology & hepatology.

[46]  G. Vriend,et al.  A text-mining analysis of the human phenome , 2006, European Journal of Human Genetics.

[47]  Sue Espe,et al.  Malacards: The Human Disease Database , 2018, Journal of the Medical Library Association : JMLA.

[48]  Thomas Shafee,et al.  Evolution of Wikipedia’s medical content: past, present and future , 2017, Journal of Epidemiology & Community Health.

[49]  Doron Lancet,et al.  MalaCards: A Comprehensive Automatically‐Mined Database of Human Diseases , 2014, Current protocols in bioinformatics.

[50]  Shusaku Tsumoto,et al.  Mining Text for Disease Diagnosis , 2017, ITQM.

[51]  Samy A Azer,et al.  Is Wikipedia a reliable learning resource for medical students? Evaluating respiratory topics. , 2015, Advances in physiology education.

[52]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[53]  Søren Brunak,et al.  A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts , 2018, PLoS Comput. Biol..

[54]  María Martín,et al.  Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..

[55]  P. Robinson,et al.  The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. , 2008, American journal of human genetics.

[56]  Dong Xu,et al.  DTMiner: identification of potential disease targets through biomedical literature mining , 2016, Bioinform..

[57]  Nan Yang,et al.  A disease diagnosis and treatment recommendation system based on big data mining and cloud computing , 2018, Inf. Sci..

[58]  Alapati. Janardhana Rao,et al.  Review On Machine Learning Approach for Detecting Disease-Treatment Relations in Short Texts , 2018 .

[59]  Livia Perfetto,et al.  SIGNOR: a database of causal relationships between biological entities , 2015, Nucleic Acids Res..

[60]  Ke Wang,et al.  Mining Disease-Symptom Relation from Massive Biomedical Literature and Its Application in Severe Disease Diagnosis , 2018, AMIA.

[61]  Michael C. Rosenstein,et al.  The comparative toxicogenomics database: a cross-species resource for building chemical-gene interaction networks. , 2006, Toxicological sciences : an official journal of the Society of Toxicology.

[62]  Bin Zhang,et al.  PhosphoSitePlus, 2014: mutations, PTMs and recalibrations , 2014, Nucleic Acids Res..

[63]  Roberto Erro,et al.  The readability of the English Wikipedia article on Parkinson’s disease , 2015, Neurological Sciences.

[64]  E. Marcotte,et al.  Prioritizing candidate disease genes by network-based boosting of genome-wide association data. , 2011, Genome research.

[65]  S. Bhanumathi,et al.  Identifying symptoms and treatment for heart disease from biomedical literature using text data mining , 2017, 2017 International Conference on Computation of Power, Energy Information and Commuincation (ICCPEIC).

[66]  José Luís Oliveira,et al.  DiseaseCard: A Web-Based Tool for the Collaborative Integration of Genetic and Medical Information , 2004, ISBMDA.

[67]  Doron Lancet,et al.  MalaCards: an integrated compendium for diseases and their annotation , 2013, Database J. Biol. Databases Curation.

[68]  Leif Azzopardi,et al.  Information retrieval in the workplace: A comparison of professional search practices , 2018, Inf. Process. Manag..

[69]  Clement J. McDonald,et al.  An evaluation of medical knowledge contained in Wikipedia and its use in the LOINC database , 2010, J. Am. Medical Informatics Assoc..

[70]  Ernestina Menasalvas Ruiz,et al.  Evaluating Wikipedia as a Source of Information for Disease Understanding , 2018, 2018 IEEE 31st International Symposium on Computer-Based Medical Systems (CBMS).

[71]  Cathy H. Wu,et al.  DEXTER: Disease-Expression Relation Extraction from Text , 2018, Database J. Biol. Databases Curation.

[72]  G. Gkoutos,et al.  Analysis of the human diseasome using phenotype similarity between common, genetic, and infectious diseases , 2014, Scientific Reports.

[73]  Damian Szklarczyk,et al.  STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets , 2018, Nucleic Acids Res..

[74]  Zhiyong Lu,et al.  Best Match: New relevance search for PubMed , 2018, PLoS biology.

[75]  Martín Pérez-Pérez,et al.  Online visibility of software-related web sites: The case of biomedical text mining tools , 2019, Inf. Process. Manag..

[76]  Zhiyong Lu,et al.  Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature , 2016, J. Am. Medical Informatics Assoc..

[77]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[78]  Gang Feng,et al.  Disease Ontology: a backbone for disease semantic integration , 2011, Nucleic Acids Res..