Large Scale Semantic Annotation, Indexing and Search at The National Archives

This paper describes a tool developed to improve access to the enormous volume of data housed at the UK's National Archives, both for the general public and for specialist researchers. The system we have developed, TNA-Search, enables a multi-paradigm search over the entire electronic archive (42TB of data in various formats). The search functionality allows queries that arbitrarily mix any combination of full-text, structural, linguistic and semantic queries. The archive is annotated and indexed with respect to a massive semantic knowledge base containing data from the LOD cloud, data.gov.uk, related TNA projects, and a large geographical database. The semantic annotation component achieves approximately 83% F-measure, which is very reasonable considering the wide range of entities and document types and the open domain. The technologies are being adopted by real users at The National Archives and will form the core of their suite of search tools, with additional in-house interfaces.

[1]  Valentin Tablan,et al.  Information Extraction and Semantic Annotation for Multi-Paradigm Information Management , 2011, Current Challenges in Patent Information Retrieval.

[2]  Kalina Bontcheva,et al.  GATECloud.net: a platform for large-scale, open-source text processing on the cloud , 2013, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[3]  Diana Maynard,et al.  JAPE: a Java Annotation Patterns Engine , 2000 .

[4]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[5]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[6]  Kalina Bontcheva,et al.  Indexing and querying linguistic metadata and document content , 2007 .

[7]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[8]  John Tait,et al.  Current Challenges in Patent Information Retrieval , 2011, The Information Retrieval Series.

[9]  Kalina Bontcheva,et al.  Natural Language Technology for Information Integration in Business Intelligence , 2007, BIS.

[10]  Kalina Bontcheva,et al.  SVM Based Learning System for Information Extraction , 2004, Deterministic and Statistical Methods in Machine Learning.

[11]  Yorick Wilks,et al.  Named Entity Recognition from Diverse Text Types , 2001 .

[12]  Atanas Kiryakov,et al.  KIM – a semantic platform for information extraction and retrieval , 2004, Natural Language Engineering.

[13]  Kalina Bontcheva,et al.  Knowledge management and human language: crossing the chasm , 2005, J. Knowl. Manag..

[14]  Diana Maynard,et al.  Gate Mímir: Answering Questions Google Can't , 2011 .

[15]  Daniel Schwabe,et al.  A hybrid approach for searching in the semantic web , 2004, WWW '04.

[16]  Daniela Petrelli,et al.  Hybrid Search: Effectively Combining Keywords and Semantic Searches , 2008, ESWC.

[17]  Hamish Cunningham,et al.  Information Extraction, Automatic , 2006 .