While todays orthography is very strict and seldom changes, this has not always been true. In historical texts spelling of words often not only varies from todays but in some periods even varies from use to use in a single text. Information retrieval on historical corpora can deal with these variations using fuzzy matching techniques based on Levenshtein-Distance using stochastic weights. In particular by using the noisy channel model of (3) and the simple algorithm they give. The algorithm, they use for spell checking, adapted to the problem of information retrieval of historical words, with queries in modern spelling, uses stochastic weights, learned from training pairs of modern and historical spelling. Using these weights shows an improvement over standard Levenshtein-Distance in the F-Score. The preparation of the training pairs usually depends on manual work. To avoid this work we devised an unsupervised algorithm for obtaining the training pairs.
[1]
Eric Brill,et al.
An Improved Error Model for Noisy Channel Spelling Correction
,
2000,
ACL.
[2]
Klaus U. Schulz,et al.
Information Access to Historical Documents from the Early New High German Period
,
2006,
Digital Historical Corpora.
[3]
Vladimir I. Levenshtein,et al.
Binary codes capable of correcting deletions, insertions, and reversals
,
1965
.
[4]
Fred J. Damerau,et al.
A technique for computer detection and correction of spelling errors
,
1964,
CACM.
[5]
Lalit R. Bahl,et al.
Decoding for channels with insertions, deletions, and substitutions with applications to speech recognition
,
1975,
IEEE Trans. Inf. Theory.
[6]
Geoffrey d. Chaucer,et al.
The Works Of Geoffrey Chaucer
,
1957
.