Feature minimization within decision trees

Decision trees for classification can be constructed using mathematical programming. Within decision tree algorithms, the feature minimization problem is to construct accurate decisions using as few features or attributes within each decision as possible. Feature minimization is an important aspect of data mining since it helps identify what attributes are important and helps produce accurate and interpretable decision trees. In feature minimization with bounded accuracy, we minimize the number of features using a given misclassification error tolerance. This problem can be formulated as a parametric bilinear program and is shown to be NP-complete. A parametric FrankWolfe method is used to solve the bilinear subproblems. The resulting minimization algorithm produces more compact, accurate, and interpretable trees. This procedure can be applied to many different error functions. Formulations and results for two error functions are given. One method, FM RLP-P, dramatically reduced the number of features of one dataset from 147 to 2 while maintaining an 83.6% testing accuracy. Computational results compare favorably with the standard univariate decision tree method, C4.5, as well as with linear programming methods of tree construction.

[1]  Olvi L. Mangasarian,et al.  Misclassification minimization , 1994, J. Glob. Optim..

[2]  Terrence J. Sejnowski,et al.  Analysis of hidden units in a layered network trained to classify sonar targets , 1988, Neural Networks.

[3]  Kristin P. Bennett,et al.  Bilinear separation of two sets inn-space , 1993, Comput. Optim. Appl..

[4]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[5]  O. L. Mangasariany Mathematical Programming in Machine Learning , 1996 .

[6]  F. Kianifard Applied Multivariate Data Analysis: Volume II: Categorical and Multivariate Methods , 1994 .

[7]  Somnath Mukhopadhyay,et al.  A polynomial time algorithm for the construction and training of a class of multilayer perceptrons , 1993, Neural Networks.

[8]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[9]  Kristin P. Bennett,et al.  A Parametric Optimization Method for Machine Learning , 1997, INFORMS J. Comput..

[10]  Olvi L. Mangasarian,et al.  Machine Learning via Polyhedral Concave Minimization , 1996 .

[11]  Fred Glover,et al.  IMPROVED LINEAR PROGRAMMING MODELS FOR DISCRIMINANT ANALYSIS , 1990 .

[12]  S. Odewahn,et al.  Automated star/galaxy discrimination with neural networks , 1992 .

[13]  William Nick Street,et al.  Cancer diagnosis and prognosis via linear-programming-based machine learning , 1994 .

[14]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[15]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[16]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[17]  Bethany L. Nicholson,et al.  Mathematical Programs with Equilibrium Constraints , 2021, Pyomo — Optimization Modeling in Python.

[18]  W. N. Street,et al.  Image analysis and machine learning applied to breast cancer diagnosis and prognosis. , 1995, Analytical and quantitative cytology and histology.

[19]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[20]  Paul S. Bradley,et al.  Feature Selection via Mathematical Programming , 1997, INFORMS J. Comput..

[21]  Ron Kohavi,et al.  Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology , 1995, KDD.

[22]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[23]  Kristin P. Bennett,et al.  Decision Tree Construction Via Linear Programming , 1992 .