This paper presents a fast approximate string matching method. In constructing information spaces such as digital libraries, we have to collect vast amount of information and convert it into uniformly organized data. Since much of the information must be converted from various media automatically, the space contains garbled text with various accuracy. For utilizing these texts, we need to satisfy the three requirements, i.e., high recall, high precision and fast matching process. In order to satisfy these requirements, we have been developing a two-phase matching system. The presented method is used for fast and high recall candidate word selection in the first phase. The key idea of the method is to use a portion of characters of a word and a distance pattern in order to use current index techniques. By experiments, we confirm that the presented method achieves high recall even for the poorly recognized texts.
[1]
Theodosios Pavlidis,et al.
On the Recognition of Printed Characters of Any Font and Size
,
1987,
IEEE Transactions on Pattern Analysis and Machine Intelligence.
[2]
Karen Kukich,et al.
Techniques for automatically correcting words in text
,
1992,
CSUR.
[3]
Manabu Ohta.
Probabilistic Retrieval Methods for Text with Miss-Recognized OCR Characters
,
1996
.
[4]
Y. Tanaka,et al.
Transmedia Machine and its Keyword Search over Image Texts
,
1988,
RIAO.
[5]
Andreas Myka,et al.
Fuzzy Full-Text Searches in OCR Databases
,
1995,
ADL.
[6]
Allen R. Hanson,et al.
A Contextual Postprocessing System for Error Correction Using Binary n-Grams
,
1974,
IEEE Transactions on Computers.