Cultural Heritage Digital Resources: From Extraction to Querying

This article presents a method to extract and query Cultural Heritage (CH) textual digital resources. The extraction and querying phases are linked by a common ontological representation (CIDOC-CRM). A transport format (RDF) allows the ontology to be queried in a suitable query language (SPARQL), on top of which an interface makes it possible to formulate queries in Natural Language (NL). The extraction phase exploits the propositional nature of the ontology. The query interface is based on the Generate and Select principle, where potentially suitable queries are generated to match the user input, only for the most semantically similar candidate to be selected. In the process we evaluate data extracted from the description of a medieval city (Wolfenbuttel), transform and develop two methods of computing similarity between sentences based on WordNet. Experiments are described that compare the pros and cons of the similarity measures and evaluate them.

[1]  Matt Brown,et al.  Invited talk , 2007 .

[2]  Peter Thanisch,et al.  Natural language interfaces to databases – an introduction , 1995, Natural Language Engineering.

[3]  Chris Quirk,et al.  Monolingual Machine Translation for Paraphrase Generation , 2004, EMNLP.

[4]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[5]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[6]  H. V. Jagadish,et al.  Constructing a Generic Natural Language Interface for an XML Database , 2006, EDBT.

[7]  Regina Barzilay,et al.  Paraphrasing for Automatic Evaluation , 2006, NAACL.

[8]  Kristian J. Hammond,et al.  Question Answering from Frequently Asked Question Files: Experiences with the FAQ FINDER System , 1997, AI Mag..

[9]  Amit P. Sheth Ontology Driven Information Systems in Action (Capturing and Applying Existing Knowledge to Semantic Applications , 2003 .

[10]  G. Burenhult XML Encoding of Archaeological Unstructured Data , 2002 .

[11]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Graeme Hirst,et al.  Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures , 2004 .

[13]  Martin Doerr,et al.  The CIDOC CRM, an Ontological Approach to Schema Heterogeneity , 2005, Semantic Interoperability and Integration.

[14]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.