Cleaning Data with Bayesian Methods

Data in many domains suffers corrupting processes. Our goal in this paper is to clean such data – i.e. reverse the effects of these corrupting processes. In part, we seek to produce data which supports the creation of better learners, but primarily, we want to produce data faithful to the “untampered, original” set. Previous techniques have attempted to use learners to predict problems with class values in data sets. These techniques have now been applied not only to detect but also to correct errors in data. However, these techniques suffer several problems: they can only actually correct noise in the class attribute, they do not fully leverage dependencies among attributes, and they are inappropriate for data sets with no distinguished class attribute. More recent techniques have looked at characterizing errors in non-class attributes. We use Bayesian techniques to take advantage of dependencies between attributes in a principled manner and to exploit expert knowledge of the relationships among the attributes. Our technique models the domain and noise generation process, rates its confidence that each instance is noisy (or asserts that the instance is clean), and suggests incremental corrections to nodes which appear to be corrupted.