Finding Dependency Trees from Binary Data

Much work has been done in finding interesting subsets of items, since it has broad applications in financial data analysis, e-commerce, text data mining, and so on. Though the well-known frequent pattern mining attracted much attention in research community, recently, more work has been devoted to analysis of more sophisticated relationships among items. Chow-Liu tree and low-entropy tree, for example, were used to summarize the frequent patterns. In this paper, we consider finding a novel dependency tree from binary data. It has several advantages over previous related work. Firstly, we propose a novel distance measure between items based on information theory, which captures the expected uncertainty in the item pairs and the mutual information between them. Based on this distance measure, we present a simple yet efficient algorithm for finding the dependency trees from binary data. We also show how our new approach can find applications in frequent pattern summarization. Our running example on synthetic dataset shows that our approach achieves good results compared to existing popular heuristics.

[1]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[2]  Heikki Mannila,et al.  Finding low-entropy sets and trees from binary data , 2007, KDD '07.

[3]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[4]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[5]  Mehmet M. Dalkilic,et al.  Information dependencies , 2000, PODS '00.

[6]  Jilles Vreeken,et al.  Item Sets that Compress , 2006, SDM.

[7]  Jiawei Han,et al.  Mining Compressed Frequent-Pattern Sets , 2005, VLDB.

[8]  Kaizhu Huang,et al.  Constructing a large node Chow-Liu tree based on frequent itemsets , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[9]  Arno J. Knobbe,et al.  Maximally informative k-itemsets and their efficient discovery , 2006, KDD '06.

[10]  Heikki Mannila,et al.  Finding Trees from Unordered 0-1 Data , 2006, PKDD.

[11]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[12]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[13]  Jiawei Han,et al.  Summarizing itemset patterns: a profile-based approach , 2005, KDD '05.

[14]  Gao Cong,et al.  Summarizing Frequent Patterns Using Profiles , 2006, DASFAA.

[15]  Aristides Gionis,et al.  Approximating a collection of frequent sets , 2004, KDD.

[16]  Jilles Vreeken,et al.  Compression Picks Item Sets That Matter , 2006, PKDD.

[17]  Marina Meila,et al.  An Accelerated Chow and Liu Algorithm: Fitting Tree Distributions to High-Dimensional Sparse Data , 1999, ICML.

[18]  Solmaz Kolahi,et al.  On redundancy vs dependency preservation in normalization: an information-theoretic study of 3NF , 2006, PODS '06.