Towards hybrid human-machine scientific information extraction

A wealth of valuable research data is locked within the millions of research articles published every year [1]. Extracting pertinent scientific facts (e.g., materials properties, known variants in genomics, population statistics etc.) from those articles has become an unmanageable task for researchers. This problem hinders the advancement of science, making it difficult to build on existing results, avoid unnecessary repetition, and to translate results into applications. Moreover, since these data are often loosely encoded in esoteric scientific articles intended for human consumption, they are, in general, not machine accessible. Thus, it is not often tractable to develop studies that automatically leverage this valuable information.

[1]  Carol Friedman,et al.  Research Paper: A General Natural-language Text Processor for Clinical Radiology , 1994, J. Am. Medical Informatics Assoc..

[2]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3]  Roselyne B. Tchoua,et al.  Blending Education and Polymer Science: Semi Automated Creation of a Thermodynamic Property Database. , 2016, Journal of chemical education.

[4]  Ian T. Foster,et al.  A Hybrid Human-computer Approach to the Extraction of Scientific Facts from the Literature , 2016, ICCS.

[5]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[6]  Callum Court,et al.  ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature , 2017 .

[7]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[8]  Nitash P. Balsara,et al.  Thermodynamics of Polymer Blends , 2007 .

[9]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[10]  Hongfang Liu,et al.  Representing information in patient reports using natural language processing and the extensible markup language. , 1999, Journal of the American Medical Informatics Association : JAMIA.

[11]  Mark Ware,et al.  The STM report: An overview of scientific and scholarly journal publishing fourth edition , 2015 .

[12]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[13]  Ian T. Foster,et al.  Towards a Hybrid Human-Computer Scientific Information Extraction Pipeline , 2017, 2017 IEEE 13th International Conference on e-Science (e-Science).