MANENT: An Infrastructure for Integrating, Structuring and Searching Digital Libraries

Digital Libraries represent the commitment of research communities to preserve authoritative and well structured sources of knowledge, and to share archival organisations, methods and resources thanks to systems relying on standard metadata formats. This chapter describes some natural language processing techniques exploited for automatically extracting structural information from documents stored in Digital Libraries, based on the exposed metadata. The most prominent results achieved in this area are surveyed and discussed. As an example of an infrastructure for integrating, structuring and searching Digital Libraries based on natural language processing and semantic web techniques, we discuss the MANENT system. MANENT is a working prototype offering services of Digital Library content management and record classification and retrieval. It is hosted on a server at the Computer Science Department of Genova University and, starting from 2011, it will become publicly available. 475,000 records drawn from 138 repositories that all over the world expose OAI-PMH services have been downloaded, stored, and their automatic classification is under way.

[1]  Carlo Strapparava,et al.  Unsupervised and supervised exploitation of semantic domains in lexical disambiguation , 2004, Comput. Speech Lang..

[2]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[3]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[4]  C. M. Sperberg-McQueen,et al.  Extensible markup language , 1997 .

[5]  Carlo Strapparava,et al.  Semantic Domains in Computational Linguistics , 2009 .

[6]  Eneko Agirre,et al.  Word Sense Disambiguation: Algorithms and Applications , 2007 .

[7]  Carlo Strapparava,et al.  The role of domain information in Word Sense Disambiguation , 2002, Natural Language Engineering.

[8]  Alexander Gelbukh,et al.  Computational Linguistics and Intelligent Text Processing , 2015, Lecture Notes in Computer Science.

[9]  Stephan Bloehdorn,et al.  Ontology-Based Question Answering for Digital Libraries , 2007, ECDL.

[10]  Ling Liu,et al.  Encyclopedia of Database Systems , 2009, Encyclopedia of Database Systems.

[11]  Heiko Schuldt,et al.  The Delos digital library reference model : foundations for digital libraries , 2007 .

[12]  Nicola Ferro,et al.  Annotation Search: The FAST Way , 2009, ECDL.

[13]  Heiner Stuckenschmidt,et al.  Ontology-Based Integration of Information - A Survey of Existing Approaches , 2001, OIS@IJCAI.

[14]  Lindsay Fullerton,et al.  Trust Online: Young Adults' Evaluation of Web Content , 2010 .

[15]  Herbert Van de Sompel,et al.  Open Archives Initiative - Protocol for Metadata Harvesting - Guidelines for Repository Implementers , 2005 .

[16]  Sebastian Ryszard Kruk,et al.  Semantic Digital Libraries , 2009, Semantic Digital Libraries.

[17]  Nicola Ferro,et al.  A formal model of annotations of digital content , 2007, TOIS.

[18]  Robin C. Cover,et al.  Metadata Encoding and Transmission Standard (METS) , 2002 .

[19]  Stuart Macdonald,et al.  User Engagement in Research Data Curation , 2009, ECDL.

[20]  Bernardo Magnini,et al.  Integrating Subject Field Codes into WordNet , 2000, LREC.

[21]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[22]  James Allan,et al.  A comparison of sentence retrieval techniques , 2007, SIGIR.

[23]  Paul Buitelaar,et al.  Ontology-based information extraction and integration from heterogeneous data sources , 2008, Int. J. Hum. Comput. Stud..

[24]  Wang Jun Open Archives Initiative Protocol for Metadata Harvesting , 2005 .

[25]  Deborah L. McGuinness,et al.  OWL Web ontology language overview , 2004 .

[26]  Susan T. Dumais,et al.  Similarity Measures for Short Segments of Text , 2007, ECIR.

[27]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[28]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[29]  Dan Brickley,et al.  Rdf vocabulary description language 1.0 : Rdf schema , 2004 .

[30]  Emanuele Pianta,et al.  Revising the Wordnet Domains Hierarchy: semantics, coverage and balancing , 2004 .

[31]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[32]  Nicola Ferro,et al.  The NESTOR Framework: How to Handle Hierarchical Data Structures , 2009, ECDL.

[33]  Angela Locoro Tagging Domain Ontologies with WordNet Domains An Approach for Fostering Ontology Classification, Engineering and Matching , 2010 .

[34]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[35]  E. Wenger Communities of Practice: Learning, Meaning, and Identity , 1998 .

[36]  Panos Constantopoulos,et al.  Research and Advanced Technology for Digital Libraries , 2001, Lecture Notes in Computer Science.

[37]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[38]  Nicholas M. Allix Book Review: Communities of Practice: Learning, Meaning, and Identity , 2000 .

[39]  Antonina Dattolo,et al.  Toward Semantic Digital Libraries: Exploiting Web 2.0 and Semantic Services in Cultural Heritage , 2009, J. Digit. Inf..

[40]  Etienne Wenger,et al.  Communities of Practice: Learning, Meaning, and Identity , 1998 .

[41]  Heiko Schuldt,et al.  DelosDLMS - The Integrated DELOS Digital Library Management System , 2007, DELOS.

[42]  Jane Hunter,et al.  Harvana: harvesting community tags to enrich collection metadata , 2008, JCDL '08.

[43]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.