Extracting structured data from publications in the Art Conservation Domain

The most common method of publishing new discoveries about art conservation techniques and research has been through traditional full-text publications. Such corpora typically only support searching via metadata (e.g. title, authors, or keywords) and full-text. In particular, it is difficult to discover valuable information about the chemical processes, experimental results, or preservation treatments associated with the conservation of paintings from a specific genre. This article addresses this problem by focusing on the extraction of structured data (that complies with a pre-defined ontology) from a distributed corpus of publications about painting conservation. Our specific extraction method involves a unique combination of named entity recognition (using gazetteer-based and machine learning-based methods) followed by relationship extraction (using rule-based and machine learning-based methods). The resulting structured data are stored in a resource description framework triple store, and a Web-based graphical user interface enables the SPARQL querying, retrieval, and display of the search results. The results from applying our techniques to a corpus of publications on art conservation indicate that our approach achieves higher quality precision and recall in extracting named entities and relations from publications, relative to alternative existing approaches.