Defining diff as a Data Mining Primitive

The emphasis on discovery in the knowledge discovery process while important in its own right, has distracted from the equally important process of knowledge representation and maintenance. For a system to indicate what is new or different, it must have an understanding of what is old or well understood or expected. In this paper, we propose diff as a fundamental data mining primitive. We show how it can be used to capture knowledge, either as a set of representative instances or as a set of rules, in a framework that is tightly integrated with the knowledge discovery process. We show how it can be applied to both discrete and continuous attributes and association rules. Lastly, we show how it enables the user to pinpoint high-level differences between two data sets that share the same attributes.