Generating Search Term Variants for Text Collections with Historic Spellings

In this paper, we describe a new approach for retrieval in texts with non-standard spelling, which is important for historic texts in English or German. For this purpose, we present a new algorithm for generating search term variants in ancient orthography. By applying a spell checker on a corpus of historic texts, we generate a list of candidate terms for which the contemporary spellings have to be assigned manually. Then our algorithm produces a set of probabilistic rules. These probabilities can be considered for ranking in the retrieval stage. An experimental comparison shows that our approach outperforms competing methods.

[1]  Uwe Quasthoff Projekt Der Deutsche Wortschatz , 1997, GLDV-Jahrestagung.

[2]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[3]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[4]  Carol Peters,et al.  Cross-Language Information Retrieval and Evaluation , 2001, Lecture Notes in Computer Science.

[5]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[6]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[7]  Justin Zobel,et al.  Phonetic string matching: lessons from information retrieval , 1996, SIGIR '96.

[8]  Henrik Nottelmann,et al.  PIRE: An Extensible IR Engine Based on Probabilistic Datalog , 2005, ECIR.

[9]  Jadzia Cendrowska,et al.  PRISM: An Algorithm for Inducing Modular Rules , 1987, Int. J. Man Mach. Stud..

[10]  Mercedes Arroyo Huguet,et al.  LE MONDE, Paris , 1997 .

[11]  Dawn Archer,et al.  VARD versus WORD: A comparison of the UCREL variant detector and modern spellcheckers on English historical corpora , 2005 .

[12]  R. E. Keller Die deutsche Sprache und ihre historische Entwicklung , 1986 .

[13]  Norbert Fuhr,et al.  Retrieval Effectiveness of Proper Name Search Methods , 1996, Inf. Process. Manag..

[14]  Rafael Camps,et al.  Improving the Efficacy of Approximate Searching by Personal-Name , 2003, NLDB.

[15]  Daniel Biella,et al.  Edition électronique de la réception de Nietzsche des années 1865 à 1945 , 2003 .