Applying Citizen Science to Gene, Drug, Disease Relationship Extraction from Biomedical Abstracts

Biomedical literature is growing at a rate that outpaces our ability to harness the knowledge contained therein. In order to mine valuable inferences from the large volume of literature, many researchers have turned to information extraction algorithms to harvest information in biomedical texts. Information extraction is usually accomplished via a combination of manual expert curation and computational methods. Advances in computational methods usually depends on the generation of gold standards by a limited number of expert curators. This process can be time consuming and represents an area of biomedical research that is ripe for exploration with citizen science. Citizen scientists have been previously found to be willing and capable of performing named entity recognition of disease mentions in biomedical abstracts, but it was uncertain whether or not the same could be said of relationship extraction. Relationship extraction requires training on identifying named entities as well as a deeper understanding of how different entity types can relate to one another. Here, we used the web-based application Mark2Cure (https://mark2cure.org) to demonstrate that citizen scientists can perform relationship extraction and confirm the importance of accurate named entity recognition on this task. We also discuss opportunities for future improvement of this system, as well as the potential synergies between citizen science, manual biocuration, and natural language processing.

[1]  Girish Chavan,et al.  NOBLE – Flexible concept recognition for large-scale biomedical natural language processing , 2016, BMC Bioinformatics.

[2]  Angli Liu,et al.  Effective Crowd Annotation for Relation Extraction , 2016, NAACL.

[3]  Marcelo Fiszman,et al.  The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text , 2003, J. Biomed. Informatics.

[4]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[5]  Cheng Zhang,et al.  Biomedical text mining and its applications in cancer research , 2013, J. Biomed. Informatics.

[6]  Michel Dumontier,et al.  Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO) , 2017, J. Biomed. Informatics.

[7]  Janos X. Binder,et al.  DISEASES: Text mining and data integration of disease–gene associations , 2014, bioRxiv.

[8]  Christopher D. Manning,et al.  Combining Distant and Partial Supervision for Relation Extraction , 2014, EMNLP.

[9]  Peter Murray-Rust ContentMine: Mining Scientific Literature , 2017 .

[10]  Lora Aroyo,et al.  Achieving Expert-Level Annotation Quality with CrowdTruth: The Case of Medical Relation Extraction , 2015, BDM2I@ISWC.

[11]  Srinivas C. Turaga,et al.  Space-time wiring specificity supports direction selectivity in the retina , 2014, Nature.

[12]  Margaret Kosmala,et al.  Assessing data quality in citizen science (preprint) , 2016, bioRxiv.

[13]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[14]  Yue Zhang,et al.  A transition‐based joint model for disease named entity recognition and normalization , 2017, Bioinform..

[15]  Richard Y. Wang,et al.  Data Quality , 2000, Advances in Database Systems.

[16]  Yi Guo,et al.  OC-2-KB: integrating crowdsourcing into an obesity and cancer knowledge base curation system , 2018, BMC Medical Informatics and Decision Making.

[17]  Halil Kilicoglu,et al.  SemMedDB: a PubMed-scale repository of biomedical semantic predications , 2012, Bioinform..

[18]  Weigelhofer Gabriele,et al.  Data Quality in Citizen Science Projects: Challenges and Solutions , 2016 .

[19]  Jung-Hsien Chiang,et al.  Literature-based discovery of new candidates for drug repurposing , 2016, Briefings Bioinform..

[20]  Benjamin M. Good,et al.  Citizen Science for Mining the Biomedical Literature , 2016, bioRxiv.

[21]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[22]  Zhiyong Lu,et al.  GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains , 2015, BioMed research international.

[23]  U. Urzúa,et al.  Tumor and reproductive traits are linked by RNA metabolism genes in the mouse ovary: a transcriptome-phenotype association analysis , 2010, BMC Genomics.

[24]  John H. Debes,et al.  DISK DETECTIVE: DISCOVERY OF NEW CIRCUMSTELLAR DISK CANDIDATES THROUGH CITIZEN SCIENCE , 2016, 1607.05713.

[25]  Yifan Peng,et al.  Extracting chemical–protein relations with ensembles of SVM and deep learning models , 2018, Database J. Biol. Databases Curation.

[26]  Alex C. Williams,et al.  A computational pipeline for crowdsourced transcriptions of Ancient Greek papyrus fragments , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[27]  Ute Schmiedel,et al.  Contributions of paraecologists and parataxonomists to research, conservation, and social development , 2016, Conservation biology : the journal of the Society for Conservation Biology.

[28]  Miguel Angel Luengo-Oroz,et al.  Crowdsourcing Malaria Parasite Quantification: An Online Game for Analyzing Images of Infected Thick Blood Smears , 2012, Journal of medical Internet research.

[29]  Shixian Ning,et al.  Chemical-induced disease relation extraction with dependency information and prior knowledge , 2018, J. Biomed. Informatics.

[30]  Kristine F. Stepenuck,et al.  Citizen science can improve conservation science, natural resource management, and environmental protection , 2017 .

[31]  Patrick Ruch,et al.  Text Mining to Support Gene Ontology Curation and Vice Versa. , 2017, Methods in molecular biology.

[32]  M. Haklay Citizen Science and Volunteered Geographic Information: Overview and Typology of Participation , 2013 .

[33]  Chang Wang,et al.  Medical Relation Extraction with Manifold Models , 2014, ACL.

[34]  Bin Liu,et al.  Crowdsourcing the General Public for Large Scale Molecular Pathology Studies in Cancer , 2015, EBioMedicine.

[35]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[36]  Xiaolin Li,et al.  GRAM-CNN: a deep learning approach with local context for named entity recognition in biomedical text , 2017, Bioinform..

[37]  Oded Nov,et al.  A natural user interface to integrate citizen science and physical exercise , 2017, PloS one.

[38]  Dietrich Rebholz-Schuhmann,et al.  PhenoMiner: from text to a database of phenotypes associated with OMIM diseases , 2015, Database J. Biol. Databases Curation.

[39]  Jelena Jovanovic,et al.  Semantic annotation in biomedicine: the current landscape , 2017, Journal of Biomedical Semantics.

[40]  O. Troyanskaya,et al.  Predicting gene function in a hierarchical context with an ensemble of classifiers , 2008, Genome Biology.

[41]  Martin Krallinger,et al.  LimTox: a web tool for applied text mining of adverse event and toxicity associations of compounds, drugs and genes , 2017, Nucleic Acids Res..

[42]  H. Andernach,et al.  Radio Galaxy Zoo: discovery of a poor cluster through a giant wide-angle tail radio galaxy , 2016, 1606.05016.

[43]  Lars Juhl Jensen,et al.  EXTRACT: interactive extraction of environment metadata and term suggestion for metagenomic sample annotation , 2016, Database J. Biol. Databases Curation.

[44]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[45]  Tong Shu Li,et al.  A crowdsourcing workflow for extracting chemical-induced disease relations from free text , 2016, Database J. Biol. Databases Curation.

[46]  Chris Welty,et al.  Crowd Truth: Harnessing disagreement in crowdsourcing a relation extraction gold standard , 2013 .

[47]  Usman Qamar,et al.  A Relation Extraction Framework for Biomedical Text Using Hybrid Feature Set , 2015, Comput. Math. Methods Medicine.

[48]  Dongdong Sun,et al.  MPTM: A tool for mining protein post-translational modifications from literature , 2017, J. Bioinform. Comput. Biol..

[49]  Miranda C. P. Straub Giving Citizen Scientists a Chance: A Study of Volunteer-led Scientific Discovery , 2016 .

[50]  Benjamin M. Good,et al.  Microtask Crowdsourcing for Disease Mention Annotation in PubMed Abstracts , 2014, Pacific Symposium on Biocomputing.

[51]  Lin Li,et al.  A gene–phenotype relationship extraction pipeline from the biomedical literature using a representation learning approach , 2018, Bioinform..

[52]  Xiaoyan Zhu,et al.  Building Disease-Specific Drug-Protein Connectivity Maps from Molecular Interaction Networks and PubMed Abstracts , 2009, PLoS Comput. Biol..