PUB: A Class Description Technique Based on Partial Coverage of Subspace

A good description of a class should be accurate and interpretable. Previous works describe classes either by analyzing the correlation of each attribute with the class, or by producing rules as in building a classifier. These solutions suffer from issues in accuracy and interpretability. A description naturally consists of sentences, where each sentence consists of a set of terms. Normally, a sentence is defined as a disjunction or conjunction of several terms, each of which specifies a constraint (range/set of values) on an attribute. From the data analysis point of view, a sentence specifies a subspace in the database. In this paper, we create a richer yet interpretable form of a sentence, i.e., a sentence describes an object if any $k$ attributes of that object satisfy the specified constraints. To that end, we design \textsc{Pub}, an algorithm that produces descriptions with our form of sentences. While constructing a sentence (within the description), \textsc{Pub} finds the optimal range/set of values for each attribute in linear time. We also empirically show that \textsc{Pub} is efficient, and able to produce more accurate, concise and interpretable descriptions than current approaches on various real datasets.

[1]  Luc De Raedt,et al.  Correlated itemset mining in ROC space: a constraint programming approach , 2009, KDD.

[2]  Cheng Yang,et al.  Efficient discovery of error-tolerant frequent itemsets in high dimensions , 2001, KDD '01.

[3]  Yasuhiko Morimoto,et al.  Data Mining with optimized two-dimensional association rules , 2001, TODS.

[4]  Anthony K. H. Tung,et al.  Fault-Tolerant Frequent Pattern Mining: Problems and Challenges , 2001, DMKD.

[5]  Jiawei Han,et al.  Generalization-Based Data Mining in Object-Oriented Databases Using an Object Cube Model , 1998, Data Knowl. Eng..

[6]  Dimitrios Gunopulos,et al.  Automatic Subspace Clustering of High Dimensional Data , 2005, Data Mining and Knowledge Discovery.

[7]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[8]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[9]  Peter L. Brooks,et al.  Visualizing data , 1997 .

[10]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[11]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[12]  E. F. Codd,et al.  Providing OLAP to User-Analysts: An IT Mandate , 1998 .

[13]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[14]  Nimrod Megiddo,et al.  Discovery-Driven Exploration of OLAP Data Cubes , 1998, EDBT.

[15]  Jiawei Han,et al.  Attribute-Oriented Induction in Relational Databases , 1991, Knowledge Discovery in Databases.

[16]  Vipin Kumar,et al.  Quantitative evaluation of approximate frequent pattern mining algorithms , 2008, KDD.