Algorithms for Finding Attribute Value Group for Binary Segmentation of Categorical Databases

We consider the problem of finding a set of attribute values that give a high quality binary segmentation of a database. The quality of a segmentation is defined by an objective function suitable for the user's objective, such as "mean squared error," "mutual information," or "/spl chi//sup 2/" each of which is defined in terms of the distribution of a given target attribute. Our goal is to find value groups on a given conditional domain that split databases into two segments, optimizing the value of an objective function. Though the problem is intractable for general objective functions, there are feasible algorithms for finding high quality binary segmentations when the objective function is convex, and we prove that the typical criteria mentioned above are all convex. We propose two practical algorithms, based on computational geometry techniques, which find a much better value group than conventional heuristics.

[1]  M. Pazzani,et al.  ID2-of-3: Constructive Induction of M-of-N Concepts for Discriminators in Decision Trees , 1991 .

[2]  Yasuhiko Morimoto,et al.  Mining Optimized Association Rules for Numeric Attributes , 1999, J. Comput. Syst. Sci..

[3]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[4]  Yasuhiko Morimoto,et al.  Computing Optimized Rectilinear Regions for Association Rules , 1997, KDD.

[5]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[6]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[7]  David Haussler,et al.  ɛ-nets and simplex range queries , 1987, Discret. Comput. Geom..

[8]  Yasuhiko Morimoto,et al.  Data mining using two-dimensional optimized association rules: scheme, algorithms, and visualization , 1996, SIGMOD '96.

[9]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[10]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[11]  Yasuhiko Morimoto,et al.  Implementation and Evaluation of Decision Trees with Range and Region Splitting , 2004, Constraints.

[12]  Yasuhiko Morimoto,et al.  Efficient Construction of Regression Trees with Range and Region Splitting , 1997, Machine Learning.

[13]  Kyuseok Shim,et al.  PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning , 1998, Data Mining and Knowledge Discovery.

[14]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[15]  Yasuhiko Morimoto,et al.  Interval Finding and Its Application to Data Mining , 1996, ISAAC.

[16]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[17]  JOHANNES GEHRKE,et al.  RainForest—A Framework for Fast Decision Tree Construction of Large Datasets , 1998, Data Mining and Knowledge Discovery.

[18]  David P. Dobkin,et al.  Probing convex polytopes , 1986, STOC '86.

[19]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[20]  Hiroshi Imai,et al.  epsilon-Approximations of k-Label Spaces , 1995, Theor. Comput. Sci..

[21]  Yasuhiko Morimoto,et al.  Constructing Efficient Decision Trees by Using Optimized Numeric Association Rules , 1996, VLDB.

[22]  Tetsuo Asano,et al.  Topological Walk Revisited , 1994, CCCG.

[23]  Jerzy W. Grzymala-Busse,et al.  Rough Sets , 1995, Commun. ACM.