Natural Language Processing for Cultural Heritage Domains

Museums, archives, libraries and other cultural heritage institutes maintain large collections of artefacts which are valuable knowledge sources for both experts and interested lay persons. Recently, more and more cultural heritage institutes have started to digitise their collections, for instance to make them accessible via web portals. However, while digitisation is a necessary first step towards improved information access, to fully unlock the knowledge contained in these collections, users have to be able to easily browse, search and query these collections. This requires cleaning, linking and enriching the data, a process that is often too time-consuming to be performed manually. Information technology can help with (partially) automating this task. Since data processing and enrichment typically involve the textual metadata level, natural language processing has a key role to play in this endeavour. At the same time cultural heritage domains pose significant challenges for language technology and call for the development of very robust and flexible solutions. Consequently, cultural heritage data can also serve as a good test-bed for the development of robust natural language processing tools.

[1]  A. Smeulders,et al.  A Multidisciplinary Approach to Unlocking Television Broadcast Archives , 2009 .

[2]  Antal van den Bosch,et al.  Recommending scientific articles using citeulike , 2008, RecSys '08.

[3]  Eero Hyvönen,et al.  CultureSampo-Finnish Culture on the Semantic Web: The Vision and First Results , 2007 .

[4]  Lora Aroyo,et al.  Evaluating a Thesaurus Browser for an Audio-visual Archive , 2006, EKAW.

[5]  Daniel P. Lopresti Performance evaluation for text processing of noisy inputs , 2005, SAC '05.

[6]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[7]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[8]  Ida G. Sprinkhuizen-Kuyper,et al.  Information Retrieval from Historical Corpora , 2002 .

[9]  Markus Forsberg,et al.  Something Old , Something New : A Computational Morphological Description of Old Swedish , 2008 .

[10]  Ying Chen,et al.  Towards Robust Unsupervised Personal Name Disambiguation , 2007, EMNLP-CoNLL.

[11]  Andrew Stawowczyk Long Handbook for Digital Projects: A Management Tool for Preservation and Access , 2000 .

[12]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[13]  Elaine Svenonius,et al.  Unanswered questions in the design of controlled vocabularies , 1986, J. Am. Soc. Inf. Sci..

[14]  Bhuvana Ramabhadran,et al.  Supporting access to large digital oral history archives , 2002, JCDL '02.

[15]  Bhuvana Ramabhadran,et al.  Automatic recognition of spontaneous speech for access to multilingual oral history archives , 2004, IEEE Transactions on Speech and Audio Processing.

[16]  Martin Reynaert,et al.  Text Induced Spelling Correction , 2004, COLING.

[17]  David Bamman,et al.  Improving OCR Accuracy for Classical Critical Editions , 2009, ECDL.

[18]  Claus Zinn,et al.  Integrated access to cultural heritage resources through representation and alignment of controlled vocabularies , 2008 .

[19]  Eiríkur Rögnvaldsson,et al.  Morphosyntactic Tagging of Old Icelandic Texts and Its Use in Studying Syntactic Variation and Change , 2011, Language Technology for Cultural Heritage.

[20]  Norbert Fuhr,et al.  Retrieval in text collections with historic spelling using linguistic and spelling variants , 2007, JCDL '07.

[21]  C. V. Howard Particulate Aerosols, Incinerators and Health , 2000 .

[22]  Huajun Chen,et al.  The Semantic Web , 2011, Lecture Notes in Computer Science.

[23]  Lora Aroyo,et al.  Using Semantic Relations for Content-based Recommender Systems in Cultural Heritage , 2009, WOP.

[24]  Vitor ROCIO,et al.  ATALA 59 AUTOMATED CREATION OF A PARTIALLY SYNTACTICALLY ANNOTATED CORPUS OF MEDIEVAL PORTUGUESE USING CONTEMPORARY PORTUGUESE RESOURCES , 1999 .

[25]  Amy Isard,et al.  Speaking the Users' Languages , 2003, IEEE Intell. Syst..

[26]  F. Nicolucci,et al.  Integration of Complementary Archaeological Sources , 2010 .

[27]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[28]  Robert J. Gaizauskas,et al.  Event coreference for information extraction , 1997 .

[29]  A. Madansky Identification of Outliers , 1988 .

[30]  Sargur N. Srihari,et al.  Search engine for handwritten documents , 2005, IS&T/SPIE Electronic Imaging.

[31]  Vangelis Karkaletsis,et al.  An Intelligent Authoring Environment for Abstract Semantic Representations of Cultural Object Descriptions , 2009, LaTeCH - SHELT&R@EACL.

[32]  Howard D. Wactlar,et al.  Facilitating access to large digital oral history archives through informedia technologies , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[33]  Lora Aroyo,et al.  Knowledge-Based Linguistic Annotation of Digital Cultural Heritage Collections , 2009, IEEE Intelligent Systems.

[34]  Marieke van Erp,et al.  Identifying Named Entities in Text Databases from the Natural History Domain , 2006, LREC.

[35]  Douglas W. Oard,et al.  Improving text classification for oral history archives with temporal domain knowledge , 2007, SIGIR.

[36]  Thomas Redman,et al.  Data quality for the information age , 1996 .

[37]  C. S. Porleder,et al.  Correcting ‘ Wrong-Column ’ Errors in Text Databases , 2006 .

[38]  Ewan Klein,et al.  Automatic Extraction of Archaeological Events from Text , 2009 .

[39]  Arthur Chapman,et al.  © 2005, Global Biodiversity Information Facility Material in this publication is free to use, with proper attribution. Recommended citation format: Chapman, A. D. 2005. Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. , 2005 .

[40]  Sanda M. Harabagiu,et al.  A Linguistic Resource for Discovering Event Structures and Resolving Event Coreference , 2008, LREC.

[41]  Tinne Tuytelaars,et al.  Memory Based Learning and the interpretation of Numbers in archaeological Reports , 2007 .

[42]  Véronique Malaisé,et al.  Automatic Annotation Suggestions for Audiovisual Archives: Evaluation Aspects , 2009 .

[43]  Mary Baker,et al.  A fresh look at the reliability of long-term digital storage , 2005, EuroSys.

[44]  M. de Rijke,et al.  A Cross-Language Approach to Historic Document Retrieval , 2006, ECIR.

[45]  Breck Baldwin,et al.  Cross-Document Event Coreference: Annotations, Experiments, and Observations , 1999, COREF@ACL.

[46]  Lambert Schomaker,et al.  Content-based text line comparison for historical document retrieval , 2007 .

[47]  Alan W. Biermann,et al.  Coreference, cross-document coreference, and information extraction methodologies , 1998 .

[48]  David Vance,et al.  Museum Collections and Today's Computers , 1988 .

[49]  G Stix,et al.  The mice that warred. , 2001, Scientific American.

[50]  Martin Doerr,et al.  Integration of Complementary Archaeological Sources , 2004 .

[51]  Kalina Bontcheva,et al.  Access to Multimedia Information through Multisource and Multilanguage Information Extraction , 2002, NLDB.

[52]  Daniel Gildea,et al.  Corpus Variation and Parser Performance , 2001, EMNLP.

[53]  Martin Reynaert,et al.  Non-interactive OCR Post-correction for Giga-Scale Digitization Projects , 2008, CICLing.

[54]  Ted Briscoe,et al.  Integrating Natural Language Processing with Flybase Curation , 2006, Pacific Symposium on Biocomputing.

[55]  J. J. Paijmans,et al.  What is in a Name : Recognizing Monument Names from Free-Text Monument Descriptions , 2009 .

[56]  Eero Hyvönen,et al.  A Method for Determining Ontology-Based Semantic Relevance , 2007, DEXA.

[57]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[58]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[59]  Franciska de Jong,et al.  Radio Oranje: Enhanced Access to a Historical Spoken Word Collection , 2007, CLIN 2007.

[60]  Marc Kemps-Snijders,et al.  Managing very large multimedia archives and their integration into federations , 2008 .

[61]  Stephen Clark,et al.  Adapting a Lexicalized-Grammar Parser to Contrasting Domains , 2008, EMNLP.

[62]  Kate Byrne Nested Named Entity Recognition in Historical Archive Text , 2007, International Conference on Semantic Computing (ICSC 2007).

[63]  Philip Resnik,et al.  OCR Post-Processing for Low Density Languages , 2005, HLT/EMNLP.

[64]  Fabio Massimo Zanzotto,et al.  Natural Language Processing Across Time: An Empirical Investigation on Italian , 2008, GoTAL.

[65]  Lora Aroyo,et al.  Semantic relations for content-based recommendations , 2009, K-CAP '09.

[66]  Kate Byrne,et al.  Populating the Semantic Web: Combining Text and Relational Databases as RDF , 2010 .

[67]  Daniel P. Lopresti Optical character recognition errors and their effects on natural language processing , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[68]  Andrian Marcus,et al.  Utilizing Association Rules for Identification of Possible Errors in Data Sets , 2000 .

[69]  Claire Grover,et al.  The ITI TXM Corpora: Tissue Expressions and Protein-Protein Interactions , 2008 .

[70]  Ross Parry Museums in a Digital Age , 2010 .

[71]  Jason Baldridge,et al.  Part-of-Speech Tagging for Middle English through Alignment and Projection of Parallel Diachronic Texts , 2007, EMNLP-CoNLL.

[72]  Marieke van Erp,et al.  Making a Clean Sweep of Cultural Heritage , 2009, IEEE Intelligent Systems.

[73]  Eugene Charniak,et al.  Reranking and Self-Training for Parser Adaptation , 2006, ACL.

[74]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[75]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[76]  Suzanne Stevenson,et al.  Automatically Identifying Changes in the Semantic Orientation of Words , 2010, LREC.

[77]  Daniel Jurafsky,et al.  How Verb Subcategorization Frequencies Are Affected By Corpus Choice , 1998, COLING.

[78]  Beatrice Alex,et al.  Assisted Curation: Does Text Mining Really Help? , 2007, Pacific Symposium on Biocomputing.

[79]  Piroska Lendvai,et al.  From Field Notes towards a Knowledge Base , 2008, LREC.

[80]  Caroline Sporleder,et al.  Bootstrapping Information Extraction from Field Books , 2007, EMNLP.

[81]  R. Manmatha,et al.  A search engine for historical manuscript images , 2004, SIGIR '04.

[82]  Sean M. McNee,et al.  Meeting user information needs in recommender systems , 2006 .

[83]  Dan Klein,et al.  Unsupervised Learning of Field Segmentation Models for Information Extraction , 2005, ACL.

[84]  Véronique Malaisé,et al.  Deriving semantic annotations of an audiovisual program from contextual texts , 2006 .

[85]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.