Formal models for string similarity

A common practical problem in large data processing applications is the identification of records pertaining to a particular individual or entity. If the records are accessible through a key and the value of the key is known then the retrieval problem is the standard one addressed by database researchers and practitioners. There are, however, many situations where an accurate key is not available. In such situations one must construct a model of the error process in order to identify which records can reasonably be considered as related. Of particular interest is the modelling of errors in strings. The theory of rational relations provides a formal model for describing errors in a natural way. Algorithms suggested by this theory are efficient and therefore can be used in practical applications. One particular subclass, rational equivalence relations, are useful since they partition their universes and provide a canonical form which can be used, for example, as a sort key. After presenting the theory of rational relations, the theory of rational equivalence relations is introduced, and their formal properties are developed. The idea of defining Soundex-like codes as rational functions and using mechanical procedures to derive efficient algorithms is presented. Finally mutation models are described in the form of rational relations.