Enabling information retrieval on historical document collections: the role of matching procedures and special lexica

Due to the large number of spelling variants found in historical texts, standard methods of Information Retrieval (IR) fail to produce satisfactory results on historical document collections. In order to improve recall for search engines, modern words used in queries have to be associated with corresponding historical variants found in the documents. In the literature, the use of (1) special matching procedures and (2) lexica for historical language have been suggested as two ways to solve this problem. In the first part of the paper we show how the construction of matching procedures and lexica may benefit from each other, leading the way to a combination of both approaches. A tool is presented where matching rules and a historical lexicon are built in an interleaved way based on corpus analysis. A crucial question considered in the second part of the paper is if matching procedures alone suffice to lift IR on historical texts to a satisfactory level. Since historical language changes over centuries it is not simple to obtain an answer. We present experiments where the performance of matching procedures in text collections from four centuries is studied. After classifying missed vocabulary, we measure precision and recall of the matching procedure for each period. Our results indicate that for earlier periods historical lexica represent an important corrective to matching procedures in IR applications.

[1]  Rafael Giusti,et al.  Automatic detection of spelling variation in historical corpus An application to build a Brazilian Portuguese spelling variants dictionary , 2007 .

[2]  Klaus U. Schulz,et al.  Information Access to Historical Documents from the Early New High German Period , 2006, Digital Historical Corpora.

[3]  Norbert Fuhr,et al.  Retrieval in text collections with historic spelling using linguistic and spelling variants , 2007, JCDL '07.

[4]  Norbert Fuhr,et al.  Rule-based Search in Text Databases with Nonstandard Orthography , 2006, Lit. Linguistic Comput..

[5]  M. de Rijke,et al.  A Cross-Language Approach to Historic Document Retrieval , 2006, ECIR.

[6]  Dawn Archer,et al.  VARD versus WORD: A comparison of the UCREL variant detector and modern spellcheckers on English historical corpora , 2005 .

[7]  Dawn Archer,et al.  The Identification of Spelling Variants in English and German Historical Texts: Manual or Automatic? , 2008, Lit. Linguistic Comput..

[8]  Wolfram Luther,et al.  Comparison of distance measures for historical spelling variants , 2006, IFIP AI.

[9]  Petra Maier-Meyer Lexikon und automatische Lemmatisierung , 1995 .

[10]  Klaus U. Schulz,et al.  Towards information retrieval on historical document collections: the role of matching procedures and special lexica , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[11]  Norbert Fuhr,et al.  Generating Search Term Variants for Text Collections with Historic Spellings , 2006, ECIR.

[12]  Franz Guenthner Electronic Lexica and Corpora Research at CIS , 1996 .

[13]  Rose Holley,et al.  How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs , 2009, D Lib Mag..

[14]  Thomas Pilz Searching in text databases with non-standard orthography , 2006, Digital Historical Corpora.

[15]  Luc Vincent,et al.  Google Book Search: Document Understanding on a Massive Scale , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).