Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content

In this paper we show how we used robust human language technology, such as our domain-independent and customisable named entity recogniser, for automatic content annotation and indexing in two digital library applications. Each of these applications posed a unique challenge: one required adapting the language processing components to the non-standard written conventions of 18th century English, while the other presented the challenge of processing material in multiple modalities. This reusable technology could also form the basis for the creation of computational tools for the study of cultural heritage languages, such as Ancient Greek and Latin.

[1]  Wendy G. Lehnert,et al.  Information extraction , 1996, CACM.

[2]  Kalina Bontcheva,et al.  Architectural elements of language engineering robustness , 2002, Natural Language Engineering.

[3]  Fabio Rinaldi,et al.  FACILE: Description of the NE System Used for MUC-7 , 1998, MUC.

[4]  Yorick Wilks,et al.  Named Entity Recognition from Diverse Text Types , 2001 .

[5]  Stephen Soderland,et al.  Learning to Extract Text-Based Information from the World Wide Web , 1997, KDD.

[6]  Hamish Cunningham GATE, a General Architecture for Text Engineering , 2002 .

[7]  Nadia Mana,et al.  FACILE: Classifying Texts Integrating Pattern Matching and Information Extraction , 1999, IJCAI.

[8]  Douglas E. Appelt,et al.  SRI International FASTUS SystemMUC-6 Test Results and Analysis , 1995, MUC.

[9]  Kalina Bontcheva,et al.  Extracting Information for Automatic Indexing of Multimedia Material , 2002, LREC.

[10]  Gregory R. Crane,et al.  Drudgery and deep thought , 2001, CACM.

[11]  Ralph Grishman,et al.  Information Extraction: Techniques and Challenges , 1997, SCIE.

[12]  Douglas E. Appelt,et al.  Introduction to Information Extraction , 1999, AI Commun..

[13]  Douglas E. Appelt,et al.  The Common Pattern Specification Language , 1998, TIPSTER.

[14]  Kalina Bontcheva,et al.  A Unicode-based Environment for Creation and Use of Language Resources , 2002, LREC.

[15]  Elisa Bertino,et al.  Integrated document and knowledge management for the knowledge-based enterprise , 2000 .