The Lorenz Dominance Order as a Measure of Interestingness in KDD

Ranking summaries generated from databases is useful within the context of descriptive data mining tasks where a single data set can be generalized in many different ways and to many levels of granularity. Our approach to generating summaries is based upon a data structure, associated with an attribute, called a domain generalization graph (DGG). A DGG for an attribute is a directed graph where each node represents a domain of values created by partitioning the original domain for the attribute, and each edge represents a generalization relation between these domains. Given a set of DGGs associated with a set of attributes, a generalization space can be defined as all possible combinations of domains, where one domain is selected from each DGG for each combination. This generalization space describes, then, all possible summaries consistent with the DGGs that can be generated from the selected attributes. When the number of attributes to be generalized is large or the DGGs associated with the attributes are complex, the generalization space can be very large, resulting in the generation of many summaries. The number of summaries can easily exceed the capabilities of a domain expert to identify interesting results. In this paper, we show that the Lorenz dominance order can be used to rank the summaries prior to presentation to the domain expert. The Lorenz dominance order defines a partial order on the summaries, in most cases, and in some cases, defines a total order. The rank order of the summaries represents an objective evaluation of their relative interestingness and provides the domain expert with a starting point for further subjective evaluation of the summaries.