Building knowledge graph from public data for predictive analysis: a case study on predicting technology future in space and time

A domain expert can process heterogeneous data to make meaningful interpretations or predictions from the data. For example, by looking at research papers and patent records, an expert can determine the maturity of an emerging technology and predict the geographic location(s) and time (e.g., in a certain year) where and when the technology will be a success. However, this is an expert- and manual-intensive task. This paper presents an end-to-end system that integrates heterogeneous data sources into a knowledge graph in the RDF (Resource Description Framework) format using an ontology. Then the user can easily query the knowledge graph to prepare the required data for different types of predictive analysis tools. We show a case study of predicting the (geographic) center(s) of fuel cell technologies using data collected from public sources to demonstrate the feasibility of our system. The system extracts, cleanses, and augments data from public sources including research papers and patent records. Next, the system uses an ontology-based data integration method to generate knowledge graphs in the RDF format to enable users to switch quickly between machine learning models for predictive analytic tasks. We tested the system using the Support Vector Machine and Multiple Hidden Markov Models and achieved 66.7% and 83.3% accuracy on the city and year levels of spatial and temporal resolutions, respectively.

[1]  Kristina Lerman,et al.  Semi-automatically Mapping Structured Sources into the Semantic Web , 2012, ESWC.

[2]  Ismael Rafols,et al.  Local emergence and global diffusion of research technologies: An exploration of patterns of network formation , 2010, J. Assoc. Inf. Sci. Technol..

[3]  Won-Kyung Sung,et al.  InSciTe Advanced: Service for Technology Opportunity Discovery , 2011 .

[4]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[5]  Olga G. Troyanskaya,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm332 Data and text mining , 2022 .

[6]  Daniel S. Murrell,et al.  Improving the prediction of organism-level toxicity through integration of chemical, protein target and cytotoxicity qHTS data† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c5tx00406c , 2016, Toxicology research.

[7]  Seong-Whan Lee,et al.  Off-line recognition of large-set handwritten characters with multiple hidden Markov models , 1996, Pattern Recognition.

[8]  Kyung-Ah Sohn,et al.  Knowledge boosting: a graph-based integration approach with multi-omics data and genomic knowledge for cancer clinical outcome prediction , 2014, J. Am. Medical Informatics Assoc..

[9]  Jerry R. Hobbs,et al.  An ontology of time for the semantic web , 2004, TALIP.

[10]  Kim Seonho,et al.  A Semi-Automatic Emerging Technology Trend Classifier Using SCOPUS and PATSTAT , 2015 .

[11]  Jinhyung Kim,et al.  Technology trends analysis and forecasting application based on decision tree and statistical feature analysis , 2012, Expert Syst. Appl..