Interpretable and Informative Explanations of Outcomes

In this paper, we solve the following data summarization problem: given a multi-dimensional data set augmented with a binary attribute, how can we construct an interpretable and informative summary of the factors affecting the binary attribute in terms of the combinations of values of the dimension attributes? We refer to such summaries as explanation tables. We show the hardness of constructing optimally-informative explanation tables from data, and we propose effective and efficient heuristics. The proposed heuristics are based on sampling and include optimizations related to computing the information content of a summary from a sample of the data. Using real data sets, we demonstrate the advantages of explanation tables compared to related approaches that can be adapted to solve our problem, and we show significant performance benefits of our optimizations.

[1]  Divesh Srivastava,et al.  Efficient and Effective Analysis of Data Quality using Pattern Tableaux , 2011, IEEE Data Eng. Bull..

[2]  Samuel Madden,et al.  Scorpion: Explaining Away Outliers in Aggregate Queries , 2013, Proc. VLDB Endow..

[3]  Jilles Vreeken,et al.  Tell me what i need to know: succinctly summarizing data with itemsets , 2011, KDD.

[4]  Tijl De Bie,et al.  An Information-Theoretic Approach to Finding Informative Noisy Tiles in Binary Databases , 2010, SDM.

[5]  Jiawei Han,et al.  Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration , 2003, Very Large Data Bases Conference.

[6]  RamakrishnanRaghu,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999 .

[7]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[8]  Bei Yu,et al.  On generating near-optimal tableaux for conditional functional dependencies , 2008, Proc. VLDB Endow..

[9]  Laks V. S. Lakshmanan,et al.  MDL Summarization with Holes , 2005, VLDB.

[10]  Yang Xiang,et al.  Succinct summarization of transactional databases: an overlapped hyperrectangle scheme , 2008, KDD.

[11]  Sunita Sarawagi User-cognizant multidimensional analysis , 2001, The VLDB Journal.

[12]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[13]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[14]  Raghu Ramakrishnan,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.

[15]  Laks V. S. Lakshmanan,et al.  The Generalized MDL Approach for Summarization , 2002, VLDB.

[16]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[17]  Ian Witten,et al.  Data Mining , 2000 .