Improving access to organized information

We introduce several new models and methods for improving access to organized information. The first model, Constrained Subtree Selection (CSS), has applications in web site design and the reorganization of directory structures. Given a hierarchy represented as a rooted DAG G with n weighted leaves, one selects a subtree of the transitive closure of G that minimizes the expected path cost. Path cost is the sum of the degree costs along a path from the root to a leaf. Degree cost, γ, is a function of the out degree of a node. We give a sufficient condition for γ that makes CSS NP-complete. This result holds even when the leaves have equal weight. Turning to algorithms, we give a polynomial time solution for instances of CSS where G does not constrain the choice of subtrees and γ favors nodes with at most k links. Even though CSS remains NP-hard for constant degree DAGs, we give an O(log(k)γ(d+1)) approximation for any G with maximum degree d, provided that γ favors nodes with at most k links. Finally, we give a complete characterization of the optimal trees for two special cases: (1) linear degree cost in unconstrained graphs and uniform probability distributions, and (2) logarithmic degree cost in arbitrary DAGs and uniform probability distributions. The second problem, Category Tree (CT), seeks a decision tree for categorical data where internal nodes are categories, edges are appropriate values for the categories, and leaves are data items. CT generalizes the well-studied Decision Tree (DT) problem. Our results resolve two open problems: We give a ln n + 1-approximation for DT and show DT does not have a polynomial time approximation scheme unless P=NP. Our work, while providing the first non-trivial upper and lower bounds on approximating DT, also demonstrates that DT and a subtly different problem which also bears the name decision tree have fundamentally different approximation complexity. We complement the above models with a new pruning method for k nearest neighbor queries on R-trees. We show that an extension to a popular depth-first 1-nearest neighbor query results in a theoretically better search. We call this extension Promise-Pruning and construct a class of R-trees where its application reduces the search space exponentially.

[1]  Mark Braverman,et al.  Learnability and automatizability , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[2]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[3]  Andrzej Pelc,et al.  Enhancing Hyperlink Structure for Improving Web Performance , 2002, J. Web Eng..

[4]  Evangelos Kranakis,et al.  Approximate hotlink assignment , 2004, Inf. Process. Lett..

[5]  Richard M. Karp,et al.  Minimum-redundancy coding for the discrete noiseless channel , 1961, IRE Trans. Inf. Theory.

[6]  Igor Chikalov,et al.  Bounds on Average Weighted Depth of Decision Trees , 1997, Fundam. Informaticae.

[7]  Sven Fuhrmann,et al.  Multiple Hotlink Assignment , 2001, WG.

[8]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[9]  FuAda Wai-Chee,et al.  Enhanced nearest neighbour search on the R-tree , 1998 .

[10]  Oren Etzioni,et al.  Towards adaptive Web sites: Conceptual framework and case study , 2000, Artif. Intell..

[11]  Ada Wai-Chee Fu,et al.  Enhanced nearest neighbour search on the R-tree , 1998, SGMD.

[12]  Igor Chikalov On Algorithm for Constructing of Decision Trees with Minimal Number of Nodes , 2000, Rough Sets and Current Trends in Computing.

[13]  László Lovász,et al.  Approximating Min Sum Set Cover , 2004, Algorithmica.

[14]  Günter Rote,et al.  A Dynamic Programming Algorithm for Constructing Optimal Prefix-Free Codes with Unequal Letter Costs , 1998, IEEE Trans. Inf. Theory.

[15]  Steven L. Salzberg,et al.  On growing better decision trees from data , 1996 .

[16]  Andrew Chi-Chih Yao Decision tree complexity and Betti numbers , 1994, STOC '94.

[17]  Igor Chikalov On Decision Trees with Minimal Average Depth , 1998, Rough Sets and Current Trends in Computing.

[18]  David Haussler,et al.  Learning decision trees from random examples , 1988, COLT '88.

[19]  Bernard M. E. Moret,et al.  Decision Trees and Diagrams , 1982, CSUR.

[20]  Donald E. Knuth,et al.  Dynamic Huffman Coding , 1985, J. Algorithms.

[21]  Ronald L. Rivest,et al.  Constructing Optimal Binary Decision Trees is NP-Complete , 1976, Inf. Process. Lett..

[22]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[23]  Claire Mathieu,et al.  SC-2 00 202 Huffman Coding with Unequal Letter Costs [ Extended Abstract ] , 2002 .

[24]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[25]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[26]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[27]  Hanan Samet,et al.  Distance browsing in spatial databases , 1999, TODS.

[28]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[29]  Rudolf Fleischer,et al.  Decision Trees: Old and New Results , 1999, Inf. Comput..

[30]  Robert G. Gallager,et al.  Variations on a theme by Huffman , 1978, IEEE Trans. Inf. Theory.

[31]  Jennifer Widom,et al.  The Pipelined Set Cover Problem , 2005, ICDT.

[32]  Mordecai J. Golin,et al.  Lopsided Trees, I: Analyses , 2001, Algorithmica.

[33]  Igor Chikalov On Average Depth of Decision Trees Implementing Boolean Functions , 2002, Fundam. Informaticae.

[34]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[35]  Andrzej Pelc,et al.  Strategies for Hotlink Assignments , 2000, ISAAC.

[36]  Mordecai J. Golin,et al.  Prefix Codes: Equiprobable Words, Unequal Letter Costs , 1994, SIAM J. Comput..

[37]  Prosenjit Bose,et al.  Asymmetric Communication Protocols via Hotlink Assignments , 2003, Theory of Computing Systems.

[38]  Tao Jiang,et al.  Lower Bounds on Learning Decision Lists and Trees , 1995, Inf. Comput..