The Miguel de Cervantes Digital Library: the Hispanic Voice on the Web

This paper describes the philosophy behind what represents one of the most ambitious projects of its kind ever to have been undertaken in the Spanish-speaking world: the Miguel de Cervantes Digital Library (http://cervantesvirtual. com/). It explains the reasons behind its creation, the private-public sector alliance that has made it possible, and the new ground being explored by its creators in terms of the new services it offers to its audience worldwide and of innovative application of digital methods. The final section of the paper deals with the technical underpinnings of this project at present and in the future, reporting continuing research and development activities being carried out at the Miguel de Cervantes Digital Library in the field of text markup and derived applications, such as automatic transformation of documents to different formats and complex searches performed upon the small textual objects defined by the markup scheme. A brief survey of works done on Named Entity Recognition that can be applied to automatic markup is also included. Finally, there are some comments on the research lines we intend to follow concerning information retrieval and filtering from structurally marked-up texts. This is a fascinating period in the history of libraries and publishing. For the first time, it is possible to build large-scale services where collections of information are stored in digital formats and retrieved over the networks (Arms, 2000).

[1]  Stephen P. Harter Scholarly Communication and the Digital Library: Problems and Issues , 1997, J. Digit. Inf..

[2]  Yorick Wilks,et al.  Evaluation of an Algorithm for the Recognition and Classification of Proper Names , 1996, COLING.

[3]  Makoto Murata,et al.  Transformation of Documents and Schemas by Patterns and Contextual Conditions , 1996, PODP.

[4]  Inderjeet Mani,et al.  Identifying Unknown Proper Names in Newswire Text , 1996 .

[5]  C. M. Sperberg-McQueen,et al.  TEI Lite: An Introduction to Text Encoding for Interchange , 2001, WWW 2001.

[6]  Heikki Mannila,et al.  Generating grammars for SGML tagged texts lacking DTD , 1994 .

[7]  Derick Wood,et al.  One-Unambiguous Regular Languages , 1998, Inf. Comput..

[8]  Elizabeth D. Liddy,et al.  Categorization and Standardizing Proper Nouns for Efficient Information Retrieval , 1996 .

[9]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[10]  Steven J. DeRose XML and the TEI , 1999, Comput. Humanit..

[11]  Heikki Mannila,et al.  Ordered and Unordered Tree Inclusion , 1995, SIAM J. Comput..

[12]  Alex Ceponkus,et al.  Applied XML: A Toolkit for Programmers , 1999 .

[13]  Marc Moens,et al.  XML Tools And Architecture for Named Entity Recognition , 1999, Markup Lang..

[14]  Mounia Lalmas,et al.  Representing and retrieving structured documents using the Dempster-Shafer theory of evidence: modelling and evaluation , 1998, J. Documentation.

[15]  Jean Véronis,et al.  Text Encoding Initiative , 1995, Springer Netherlands.

[16]  Marc Moens,et al.  Named Entity Recognition without Gazetteers , 1999, EACL.

[17]  Pekka Kilpeläinen,et al.  Using sgrep for querying structured text files 1 , 1996 .

[18]  James Pustejovsky,et al.  Corpus processing for lexical acquisition , 1996 .

[19]  Helmut Seidl,et al.  Locating Matches of Tree Patterns in Forests , 1998, FSTTCS.

[20]  Helena Ahonen Disambiguation of SGML Content Models , 1996, PODP.

[21]  Edward A. Fox,et al.  Digital libraries , 1995, CACM.

[22]  David D. McDonald Internal and External Evidence in the Identification and Semantic Categorization of Proper Names , 1993 .