The most common method of publishing new discoveries about art conservation techniques and research has been through traditional full-text publications. Such corpora typically only support searching via metadata (e.g. title, authors, or keywords) and full-text. In particular, it is difficult to discover valuable information about the chemical processes, experimental results, or preservation treatments associated with the conservation of paintings from a specific genre. This article addresses this problem by focusing on the extraction of structured data (that complies with a pre-defined ontology) from a distributed corpus of publications about painting conservation. Our specific extraction method involves a unique combination of named entity recognition (using gazetteer-based and machine learning-based methods) followed by relationship extraction (using rule-based and machine learning-based methods). The resulting structured data are stored in a resource description framework triple store, and a Web-based graphical user interface enables the SPARQL querying, retrieval, and display of the search results. The results from applying our techniques to a corpus of publications on art conservation indicate that our approach achieves higher quality precision and recall in extracting named entities and relations from publications, relative to alternative existing approaches.
[1]
Fabio Rinaldi,et al.
Detecting Protein-Protein Interactions in Biomedical Texts Using a Parser and Linguistic Resources
,
2009,
CICLing.
[2]
Robert Dale,et al.
Handbook of Natural Language Processing
,
2001,
Computational Linguistics.
[3]
Miguel A. Andrade-Navarro,et al.
LAITOR - Literature Assistant for Identification of Terms co-Occurrences and Relationships
,
2010,
BMC Bioinformatics.
[4]
Takayuki Itoh,et al.
Automated Information Extraction and Structure-Activity Relationship Analysis of Cytochrome P450 Substrates
,
2011,
J. Chem. Inf. Model..
[5]
Michael Feldman,et al.
caTIES: a grid based system for coding and retrieval of surgical pathology reports and tissue specimens in support of translational research
,
2010,
J. Am. Medical Informatics Assoc..
[6]
Thomas Markus,et al.
Ontology Enrichment with Social Tags for eLearning
,
2009,
EC-TEL.
[7]
William B. Langdon,et al.
BioRAT: extracting biological information from full-length papers
,
2004,
Bioinform..