A pilot investigation of information extraction in the semantic annotation of archaeological reports

The paper discusses a prototype investigation of semantic annotation, a form of metadata assigning conceptual entities to textual instances; in the case of archaeological grey literature. The use of Information Extraction (IE), a Natural Language Processing (NLP) technique, is central to the annotation process while the use of Knowledge Organization System (KOS) is explored for the association of semantic annotation with both ontological and terminological references. The annotation process follows a rule-based information extraction approach using the GATE NLP toolkit, together with the CIDOC CRM ontology, its CRM-EH archaeological extension and English Heritage thesauri and glossaries. Results are reported from an initial evaluation, which suggest that these information extraction techniques can be applied to archaeological grey literature reports. Further work is discussed drawing on the evaluation and consideration of the characteristics of the archaeology domain.

[1]  Yorick Wilks,et al.  Information Extraction: Beyond Document Retrieval , 1998, Int. J. Comput. Linguistics Chin. Lang. Process..

[2]  Diana Maynard,et al.  Metrics for Evaluation of Ontology-based Information Extraction , 2006, EON@WWW.

[3]  Douglas Tudhope,et al.  Terminology services and technology: JISC state of the art review , 2006 .

[4]  Thorsten Brants,et al.  Natural Language Processing in Information Retrieval , 2003, CLIN.

[5]  Huajun Chen,et al.  The Semantic Web , 2011, Lecture Notes in Computer Science.

[6]  Sophia Ananiadou,et al.  Thesaurus or Logical Ontology, Which One Do We Need for Text Mining? , 2005, Lang. Resour. Evaluation.

[7]  S Jeffrey,et al.  The Archaeotools project: faceted classification and natural language processing in an archaeological context , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[8]  Viviana Mascardi,et al.  A Comparison of Upper Ontologies , 2007, WOA.

[9]  Paul J Cripps,et al.  Ontological Modelling of the work of the Centre for Archaeology , 2005 .

[10]  Antoine Isaac,et al.  SKOS Simple Knowledge Organization System Primer , 2009 .

[11]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[12]  Yorick Wilks,et al.  The Semantic Web: Apotheosis of Annotation, but What Are Its Semantics? , 2008, IEEE Intelligent Systems.

[13]  Viviana Mascardi,et al.  A Comparison of Upper Ontologies (Technical Report DISI-TR-06-21) , 2006 .

[14]  Carol Friedman,et al.  Introduction: named entity recognition in biomedicine , 2004, J. Biomed. Informatics.

[15]  Ziqi Zhang,et al.  A Methodology towards Effective and Efficient Manual Document Annotation: Addressing Annotator Discrepancy and Annotation Quality , 2010, EKAW.

[16]  Stefan Gradmann,et al.  Knowledge = Information in Context: on the Importance of Semantic Contextualisation in Europeana (Delhi) , 2010 .

[17]  Diana Maynard,et al.  Ontology-based information extraction for market monitoring and technology watch , 2005 .

[18]  Kalina Bontcheva,et al.  Semantic Annotation and Human Language Technology , 2006 .

[19]  Atanas Kiryakov,et al.  Semantic Annotation, Indexing, and Retrieval , 2003, SEMWEB.

[20]  Diana Maynard,et al.  JAPE: a Java Annotation Patterns Engine , 2000 .

[21]  Kalina Bontcheva,et al.  Evolving GATE to meet new challenges in language engineering , 2004, Natural Language Engineering.

[22]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[23]  M. C. Debachere,et al.  Problems in Obtaining Grey Literature , 1995 .

[24]  Douglas Tudhope,et al.  Semantic Interoperability in Archaeological Datasets: Data Mapping and Extraction Via the CIDOC CRM , 2008, ECDL.

[25]  Marie-Francine Moens,et al.  Information Extraction: Algorithms and Prospects in a Retrieval Context , 2006, The Information Retrieval Series.

[26]  Douglas Tudhope,et al.  Connecting archaeological data and grey literature via semantic cross search , 2011 .

[27]  Kate Byrne Nested Named Entity Recognition in Historical Archive Text , 2007, International Conference on Semantic Computing (ICSC 2007).

[28]  Siegfried Handschuh,et al.  Semantic annotation for knowledge management: Requirements and a survey of the state of the art , 2006, J. Web Semant..

[29]  Nicola Guarino,et al.  Formal Ontology and Information Systems , 1998 .

[30]  Organización Internacional de Normalización ISO 25964-2 : Information and documentation -- Thesauri and interoperability with other vocabularies -- Part 2: Interoperability with other vocabularies , 2013 .

[31]  Kalina Bontcheva,et al.  Hierarchical, perceptron-like learning for ontology-based information extraction , 2007, WWW '07.

[32]  Sophia Ananiadou,et al.  The National Centre for Text Mining: Aims and Objectives , 2005 .

[33]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[34]  Douglas Tudhope,et al.  Excavating grey literature: A case study on the rich indexing of archaeological documents via natural language-processing techniques and knowledge-based resources , 2010, Aslib Proc..

[35]  Claire Grover,et al.  Named Entity Recognition for Digitised Historical Texts , 2008, LREC.

[36]  Jin-Dong Kim,et al.  The GENIA corpus: an annotated research abstract corpus in molecular biology domain , 2002 .