KnetMiner: a comprehensive approach for supporting evidence‐based gene discovery and complex trait analysis across species

Generating new ideas and scientific hypotheses is often the result of extensive literature and database reviews, overlaid with scientists’ own novel data and a creative process of making connections that were not made before. We have developed a comprehensive approach to guide this technically challenging data integration task and to make knowledge discovery and hypotheses generation easier for plant and crop researchers. KnetMiner can digest large volumes of scientific literature and biological research to find and visualise links between the genetic and biological properties of complex traits and diseases. Here we report the main design principles behind KnetMiner and provide use cases for mining public datasets to identify unknown links between traits such grain colour and pre-harvest sprouting in Triticum aestivum, as well as, an evidence-based approach to identify candidate genes under an Arabidopsis thaliana petal size QTL. We have developed KnetMiner knowledge graphs and applications for a range of species including plants, crops and pathogens. KnetMiner is the first open-source gene discovery platform that can leverage genome-scale knowledge graphs, generate evidence-based biological networks and be deployed for any species with a sequenced genome. KnetMiner is available at http://knetminer.org.

[1]  Istvan Rajcan,et al.  Identification of loci governing eight agronomic traits using a GBS-GWAS approach and validation by QTL mapping in soya bean. , 2015, Plant biotechnology journal.

[2]  Wolfram Wöß,et al.  Towards a Definition of Knowledge Graphs , 2016, SEMANTiCS.

[3]  Jürgen Umbrich,et al.  Introduction: What Is a Knowledge Graph? , 2020 .

[4]  Tobias Isenberg,et al.  A Systematic Review on the Practice of Evaluating Visualization , 2013, IEEE Transactions on Visualization and Computer Graphics.

[5]  Tijl De Bie,et al.  Subjective Interestingness in Exploratory Data Mining , 2013, IDA.

[6]  Bo Zhang,et al.  Allelic Variation and Transcriptional Isoforms of Wheat TaMYC1 Gene Regulating Anthocyanin Synthesis in Pericarp , 2017, Front. Plant Sci..

[7]  J H Holmes,et al.  Progress in Biomedical Knowledge Discovery: A 25-year Retrospective , 2016, Yearbook of Medical Informatics.

[8]  P. Visscher,et al.  Nature Genetics Advance Online Publication , 2022 .

[9]  Gautier Koscielny,et al.  Open Targets Platform: new developments and updates two years on , 2018, Nucleic Acids Res..

[10]  V. Irish,et al.  Natural Variation Identifies Multiple Loci Controlling Petal Shape and Size in Arabidopsis thaliana , 2013, PloS one.

[11]  Igor Jurisica,et al.  Knowledge Discovery and Data Mining in Biomedical Informatics: The Future Is in Integrative, Interactive Machine Learning Solutions , 2014, Interactive Knowledge Discovery and Data Mining in Biomedical Informatics.

[12]  Andrea Schrader,et al.  TRANSPARENT TESTA GLABRA 1-Dependent Regulation of Flavonoid Biosynthesis , 2017, Plants.

[13]  Antonino Fiannaca,et al.  BioGraph: a web application and a graph database for querying and analyzing bioinformatics resources , 2018, BMC Systems Biology.

[14]  E. R. Sears,et al.  Cytogenetic Studies with Polyploid Species of Wheat. I. Chromosomal Aberrations in the Progeny of a Haploid of Triticum Vulgare. , 1939, Genetics.

[15]  E. R. Sears,et al.  Cytogenetic Studies with Polyploid Species of Wheat. II. Additional Chromosomal Aberrations in Triticum Vulgare. , 1944, Genetics.

[16]  Muhammad Ali Amer,et al.  Genome-wide association study of 107 phenotypes in a common set of Arabidopsis thaliana inbred lines , 2010, Nature.

[17]  Robert Hoehndorf,et al.  Semantic Disease Gene Embeddings (SmuDGE): phenotype-based disease gene prioritization without phenotypes , 2018, bioRxiv.

[18]  Angela Karp,et al.  Genetic strategies for dissecting complex traits in biomass willows (Salix spp.). , 2014, Tree physiology.

[19]  M. Sheelagh T. Carpendale,et al.  Beyond Mouse and Keyboard: Expanding Design Considerations for Information Visualization Interactions , 2012, IEEE Transactions on Visualization and Computer Graphics.

[20]  Jonathan D. G. Jones,et al.  Shifting the limits in wheat research and breeding using a fully annotated reference genome , 2018, Science.

[21]  Michel Dumontier,et al.  Bioschemas: schema.org for the Life Sciences , 2017, SWAT4LS.

[22]  Hao Yu,et al.  MOTHER OF FT AND TFL1 regulates seed germination and fertility relevant to the brassinosteroid signaling pathway , 2010, Plant signaling & behavior.

[23]  Tudor Groza,et al.  The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species , 2016, bioRxiv.

[24]  Li Lin,et al.  Review and Trend Analysis of Knowledge Graphs for Crop Pest and Diseases , 2019, IEEE Access.

[25]  Artem Lysenko,et al.  Developing integrated crop knowledge networks to advance candidate gene discovery , 2016, Applied & translational genomics.

[26]  Paul Kersey,et al.  Ensembl Plants: Integrating Tools for Visualizing, Mining, and Analyzing Plant Genomics Data. , 2016, Methods in molecular biology.

[27]  RicheNathalie Henry,et al.  Beyond Mouse and Keyboard , 2012 .

[28]  K. Taylor,et al.  Genome-Wide Association , 2007, Diabetes.

[29]  J. Fletcher,et al.  The ULTRAPETALA gene controls shoot and floral meristem size in Arabidopsis. , 2001, Development.

[30]  Keywan Hassani-Pak,et al.  A roadmap for gene functional characterisation in crops with large genomes: Lessons from polyploid wheat , 2020, eLife.

[31]  Jens Lehmann,et al.  BioKEEN: A library for learning and evaluating biological knowledge graph embeddings , 2018 .

[32]  Sameh K. Mohamed,et al.  Discovering protein drug targets using knowledge graph embeddings , 2019, Bioinform..

[33]  Yang I Li,et al.  An Expanded View of Complex Traits: From Polygenic to Omnigenic , 2017, Cell.

[34]  Keywan Hassani-Pak,et al.  Validation and characterisation of a wheat GENIE3 network using an independent RNA-Seq dataset , 2019, bioRxiv.

[35]  Leif Azzopardi,et al.  Information retrieval in the workplace: A comparison of professional search practices , 2018, Inf. Process. Manag..

[36]  Christopher J. Rawlings,et al.  Towards FAIRer Biological Knowledge Networks Using a Hybrid Linked Data and Graph Database Approach , 2018, J. Integr. Bioinform..

[37]  Amit P. Sheth,et al.  Knowledge Graphs and Knowledge Networks: The Story in Brief , 2019, IEEE Internet Computing.

[38]  Peter Fox,et al.  Changing the Equation on Scientific Data Visualization , 2011, Science.

[39]  E. Mohammadi,et al.  Barriers and facilitators related to the implementation of a physiological track and trigger system: A systematic review of the qualitative evidence , 2017, International journal for quality in health care : journal of the International Society for Quality in Health Care.

[40]  S. Gadberry,et al.  THE STORY IN BRIEF , 2020, A Schoolmaster's War.

[41]  Erik S. Ferlanti,et al.  ePlant: Visualizing and Exploring Multiple Levels of Data for Hypothesis Generation in Plant Biology[OPEN] , 2017, Plant Cell.

[42]  John H. Holmes,et al.  Knowledge Discovery in Biomedical Data: Theory and Methods , 2014 .

[43]  Ajit Singh,et al.  KnetMaps: a BioJS component to visualize biological knowledge networks , 2018, F1000Research.

[44]  Karsten Weihe,et al.  Network Motifs Are a Powerful Tool for Semantic Distinction , 2016 .

[45]  M. Farquhar,et al.  Root hairs: Specialized tubular cells extending root surfaces , 2008, The Botanical Review.

[46]  Yi Wang,et al.  GrainGenes: centralized small grain resources and digital platform for geneticists and breeders , 2019, Database J. Biol. Databases Curation.

[47]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[48]  Christopher J. Rawlings,et al.  Enhancing Data Integration with Text Analysis to Find Proteins Implicated in Plant Stress Response , 2010, J. Integr. Bioinform..

[49]  et al.,et al.  Jupyter Notebooks - a publishing format for reproducible computational workflows , 2016, ELPUB.

[50]  Jean-Luc Jannink,et al.  The Triticeae Toolbox: Combining Phenotype and Genotype Data to Advance Small‐Grains Breeding , 2016, The plant genome.

[51]  Tijl De Bie,et al.  A Theoretical Framework for Exploratory Data Mining: Recent Insights and Challenges Ahead , 2013, ECML/PKDD.

[52]  Seon-Young Kim,et al.  Use of Graph Database for the Integration of Heterogeneous Biological Data , 2017, Genomics & informatics.

[53]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[54]  Brian M. Sweis,et al.  Sensitivity to “sunk costs” in mice, rats, and humans , 2018, Science.

[55]  Christoph Steinbeck,et al.  Bioinformatics Meets User-Centred Design: A Perspective , 2012, PLoS Comput. Biol..

[56]  Marco Brandizi,et al.  Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMiner use case , 2018, SWAT4LS.

[57]  Kou Nakazono,et al.  A Wheat Homolog of MOTHER OF FT AND TFL1 Acts in the Regulation of Germination[W][OA] , 2011, Plant Cell.