Using Tarjan's Red Rule for Fast Dependency Tree Construction

We focus on the problem of efficient learning of dependency trees. It is well-known that given the pairwise mutual information coefficients, a minimum-weight spanning tree algorithm solves this problem exactly and in polynomial time. However, for large data-sets it is the construction of the correlation matrix that dominates the running time. We have developed a new spanning-tree algorithm which is capable of exploiting partial knowledge about edge weights. The partial knowledge we maintain is a probabilistic confidence interval on the coefficients, which we derive by examining just a small sample of the data. The algorithm is able to flag the need to shrink an interval, which translates to inspection of more data for the particular attribute pair. Experimental results show running time that is near-constant in the number of records, without significant loss in accuracy of the generated trees. Interestingly, our spanning-tree algorithm is based solely on Tarjan's red-edge rule, which is generally considered a guaranteed recipe for bad performance.

[1]  Andrew W. Moore,et al.  Efficient Algorithms for Minimizing Cross Validation Error , 1994, ICML.

[2]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[3]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[4]  Raghu Ramakrishnan,et al.  Proceedings : KDD 2000 : the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 20-23, 2000, Boston, MA, USA , 2000 .

[5]  Geoff Hulten,et al.  Learning from Infinite Data in Finite Time , 2001, NIPS.

[6]  Nir Friedman,et al.  Learning Bayesian Network Structure from Massive Datasets: The "Sparse Candidate" Algorithm , 1999, UAI.

[7]  Robert E. Tarjan,et al.  Data structures and network algorithms , 1983, CBMS-NSF regional conference series in applied mathematics.

[8]  Marina Meila-Predoviciu,et al.  Learning with Mixtures of Trees , 1999 .

[9]  Marina Meila,et al.  An Accelerated Chow and Liu Algorithm: Fitting Tree Distributions to High-Dimensional Sparse Data , 1999, ICML.

[10]  Fazlollah M. Reza,et al.  Introduction to Information Theory , 2004, Lecture Notes in Electrical Engineering.

[11]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[12]  Nir Friedman,et al.  Bayesian Network Classification with Continuous Attributes: Getting the Best of Both Discretization and Parametric Fitting , 1998, ICML.

[13]  Geoff Hulten,et al.  A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering , 2001, ICML.

[14]  Andrew W. Moore,et al.  Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation , 1993, NIPS.