Data structures for maintaining set partitions

Efficiently maintaining the partition induced by a set of features is an important problem in building decision-tree classifiers. In order to identify a small set of discriminating features, we need the capability of efficiently adding and removing specific features and determining the effect of these changes on the induced classification or partition. In this paper we introduce a variety of randomized and deterministic data structures to support these operations on both general and geometrically induced set partitions. We give both Monte Carlo and Las Vegas data structures that realize near-optimal time bounds and are practical to implement. We then provide a faster solution to this problem in the geometric setting. Finally, we present a data structure that efficiently estimates the number of partitions separating elements.

[1]  Daniel M. Yellin Algorithms for subset testing and finding maximal sets , 1992, SODA '92.

[2]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[3]  Laurent Viennot,et al.  A Synthesis on Partition Refinement: A Useful Routine for Strings, Graphs, Boolean Matrices and Automata , 1998, STACS.

[4]  Steven Skiena,et al.  Data Structures for Maintaining Set Partitions , 2000, SWAT.

[5]  Jirí Matousek,et al.  Computing Many Faces in Arrangements of Lines and Segments , 1998, SIAM J. Comput..

[6]  Robert E. Tarjan,et al.  Three Partition Refinement Algorithms , 1987, SIAM J. Comput..

[7]  Philip A. Chou,et al.  Optimal Partitioning for Classification and Regression Trees , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Bernard M. E. Moret,et al.  Decision Trees and Diagrams , 1982, CSUR.

[9]  John E. Hopcroft,et al.  An n log n algorithm for minimizing states in a finite automaton , 1971 .

[10]  Kenneth H. Rosen Handbook of Discrete and Combinatorial Mathematics , 1999 .

[11]  Steven Skiena,et al.  Decision trees for geometric models , 1998, Int. J. Comput. Geom. Appl..

[12]  Kurt Mehlhorn,et al.  Lower bounds for set intersection queries , 1993, SODA '93.

[13]  Daniel M. Yellin Representing sets with constant time equality testing , 1990, SODA '90.

[14]  Yoshiko Wakabayashi The Complexity of Computing Medians of Relations , 1998 .

[15]  Ronald L. Rivest,et al.  Constructing Optimal Binary Decision Trees is NP-Complete , 1976, Inf. Process. Lett..

[16]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[17]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[18]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[19]  M. Garey Optimal Binary Identification Procedures , 1972 .

[20]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[21]  S. Skiena Interactive reconstruction via geometric probing , 1992, Proc. IEEE.

[22]  Thomas C. Shermer,et al.  Probing Polygons Minimally Is Hard , 1992, Comput. Geom..

[23]  Robert E. Tarjan,et al.  Efficiency of a Good But Not Linear Set Union Algorithm , 1972, JACM.

[24]  Thomas Lengauer,et al.  Combinatorial algorithms for integrated circuit layout , 1990, Applicable theory in computer science.

[25]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[26]  Jirí Matousek,et al.  Spanning trees with low crossing number , 1991, RAIRO Theor. Informatics Appl..

[27]  Leonidas J. Guibas,et al.  A dichromatic framework for balanced trees , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[28]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[29]  Rajeev Motwani,et al.  Randomized algorithms , 1996, CSUR.

[30]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[31]  Leonidas J. Guibas,et al.  Ray shooting in polygons using geodesic triangulations , 1991, Algorithmica.

[32]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[33]  Esko Ukkonen,et al.  Constructing Suffix Trees On-Line in Linear Time , 1992, IFIP Congress.

[34]  Steven Skiena,et al.  Geometric decision trees for optical character recognition (extended abstract) , 1997, SCG '97.

[35]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[36]  Richard Cole,et al.  Dynamic LCA queries on trees , 1999, SODA '99.

[37]  Pankaj K. Agarwal,et al.  Partitioning arrangements of lines II: Applications , 2011, Discret. Comput. Geom..

[38]  Joseph S. B. Mitchell,et al.  On the Complexity of Shattering Using Arrangements , 1991 .