Knowledge graphs support multiple research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via some form of manual curation, which is difficult to scale in the context of an increasing publication rate. Data programming is a paradigm that circumvents this arduous manual process by combining databases with simple rules and heuristics written as label functions, which are programs designed to automatically annotate textual data. Unfortunately, writing a useful label function requires substantial error analysis and is a nontrivial task that takes multiple days per function. This makes populating a knowledge graph with multiple nodes and edge types practically infeasible. We sought to accelerate the label function creation process by evaluating the extent to which label functions could be re-used across multiple edge types. We used a subset of an existing knowledge graph centered on disease, compound, and gene entities to evaluate label function re-use. We determined the best label function combination by comparing a baseline database-only model with the same model but added edge-specific or edge-mismatch label functions. We confirmed that adding additional edge-specific rather than edge-mismatch label functions often improves text annotation and shows that this approach can incorporate novel edges into our source knowledge graph. We expect that continued development of this strategy has the potential to swiftly populate knowledge graphs with new discoveries, ensuring that these resources include cutting-edge results.
[1]
Tiziana di Matteo,et al.
Graph Theory Enables Drug Repurposing – How a Mathematical Model Can Drive the Discovery of Hidden Mechanisms of Action
,
2013,
PloS one.
[2]
Daniel Himmelstein,et al.
Mining knowledge from MEDLINE articles and their indexed MeSH terms
,
2015
.
[3]
Xiaoyan Zhu,et al.
GeneTUKit: a software for document-level gene normalization
,
2011,
Bioinform..
[4]
David S. Wishart,et al.
DrugBank 5.0: a major update to the DrugBank database for 2018
,
2017,
Nucleic Acids Res..
[5]
Hung-Yu Kao,et al.
Cross-species gene normalization by species inference
,
2011,
BMC Bioinformatics.
[6]
Robert Hoehndorf,et al.
Drug repurposing through joint learning on knowledge graphs and literature
,
2018,
bioRxiv.
[7]
Simon Oxenham,et al.
Legal confusion threatens to slow data science
,
2016,
Nature.
[8]
Helen E. Parkinson,et al.
The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog)
,
2016,
Nucleic Acids Res..
[9]
Hongfang Liu,et al.
Extracting chemical–protein relations using attention-based neural networks
,
2018,
Database J. Biol. Databases Curation.
[10]
Zhiyong Lu,et al.
PubTator: a web-based text mining tool for assisting biocuration
,
2013,
Nucleic Acids Res..