A Cross-Language Approach to Historic Document Retrieval

Our cultural heritage, as preserved in libraries, archives and museums, is made up of documents written many centuries ago. Large-scale digitization initiatives, like DigiCULT, make these documents available to non-expert users through digital libraries and vertical search engines. For a user, querying a historic document collection may be a disappointing experience. Natural languages evolve over time, changing in pronunciation and spelling, and new words are introduced continuously, while older words may disappear out of everyday use. For these reasons, queries involving modern words may not be very effective for retrieving documents that contain many historic terms. Although reading a 300-year-old document might not be problematic because the words are still recognizable, the changes in vocabulary and spelling can make it difficult to use a search engine to find relevant documents. To illustrate this, consider the following example from our collection of 17th century Dutch law texts. Looking for information on the tasks of a lawyer (modern Dutch: {it advocaat}) in these texts, the modern spelling will not lead you to documents containing the 17th century Dutch spelling variant {it advocaet}. Since spelling rules were not introduced until the 19th century, 17th century Dutch spelling is inconsistent. Being based mainly on pronunciation, words were often spelled in several different variants, which poses a problem for standard retrieval engines. We therefore define Historic Document Retrieval (HDR) as the retrieval of relevant historic documents for a modern query. Our approach to this problem is to treat the historic and modern languages as different languages, and use cross-language information retrieval (CLIR) techniques to translate one language into the other.

[1]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[2]  Peter Willett,et al.  Searching for Historical Word Forms in Text Databases using Spelling-Correction Methods: Reverse error and phonetic coding Methods , 1991, J. Documentation.

[3]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[4]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .

[5]  Alexander M. Robertson,et al.  Word Variant Identification in Old French , 1997, Inf. Res..

[6]  Jacques Savoy,et al.  Combining Multiple Strategies for Effective Monolingual and Cross-Language Retrieval , 2004, Information Retrieval.

[7]  David Hawking,et al.  Overview of the TREC 2004 Web Track , 2004, TREC.

[8]  Michael Lesk Understanding Digital Libraries, Second Edition (The Morgan Kaufmann Series in Multimedia and Information Systems) , 2004 .

[9]  Peter Willett,et al.  Searching for historical word-forms in a database of 17th-century English text using spelling-correction methods , 1992, SIGIR '92.

[10]  Carol Peters,et al.  Cross-Language Evaluation Forum: Objectives, Results, Achievements , 2004, Information Retrieval.

[11]  Michael Lesk Understanding Digital Libraries , 2004 .

[12]  Cor van Bree,et al.  Geschiedenis van het Nederlands , 1992 .

[13]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[14]  Mehmet Yetis Review of: Lesk, Michael. Understanding digital libraries. 2nd. ed.. San Francisco, CA: Morgan Kaufmann, 2004 , 2005, Inf. Res..

[15]  Martin Porter,et al.  Snowball: A language for stemming algorithms , 2001 .

[16]  Jacques Savoy,et al.  Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..

[17]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[18]  M. de Rijke,et al.  Monolingual Document Retrieval for European Languages , 2004, Information Retrieval.