Hunting of the Snark: Finding Data Glitches using Data Mining Methods

Data quality is critical to data analysis because bad data can lead to incorrect conclusions. Problems with data are best detected early, before too much time and eeort are spent ingesting and analyzing it. In this paper, we propose the use of data mining techniques for the automatic detection of data problems commonly encountered in large multivariate data sets. Data mining methods are ideal for this purpose, since they are designed for nding abnormal patterns in large volumes of data. We discuss some important types of data integrity issues. We demonstrate the use of a data mining method, the DataSphere set comparison technique (from our earlier work 6]) to detect glitches that mimic the error conditions discussed, using artiicial data.