Mathematical Programming for Data Mining: Formulations and Challenges

This article is intended to serve as an overview of a rapidly emerging research and applications area. In addition to providing a general overview, motivating the importance of data mining problems within the area of knowledge discovery in databases, our aim is to list some of the pressing research challenges, and outline opportunities for contributions by the optimization research communities. Towards these goals, we include formulations of the basic categories of data mining methods as optimization problems. We also provide examples of successful mathematical programming approaches to some data mining problems.

[1]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[2]  Paul Tseng,et al.  An Incremental Gradient(-Projection) Method with Momentum Term and Adaptive Stepsize Rule , 1998, SIAM J. Optim..

[3]  Wu Li The sharp Lipschitz constants for feasible and optimal solutions of a perturbed linear program , 1993 .

[4]  David Kendrick,et al.  GAMS, a user's guide , 1988, SGNM.

[5]  O. Mangasarian Hybrid Misclassi cation Minimization , 1995 .

[6]  G. Wahba Support vector machines, reproducing kernel Hilbert spaces, and randomized GACV , 1999 .

[7]  M.H. Hassoun,et al.  Fundamentals of Artificial Neural Networks , 1996, Proceedings of the IEEE.

[8]  Tomaso A. Poggio,et al.  Example-Based Learning for View-Based Human Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  O. Mangasarian,et al.  Multicategory discrimination via linear programming , 1994 .

[10]  MinimizationO. L. Mangasarian Misclassiication Minimization , 1994 .

[11]  Diethard Klatte,et al.  On Procedures for Analysing Parametric Optimization Problems , 1982 .

[12]  Sabine Van Huffel,et al.  Total least squares problem - computational aspects and analysis , 1991, Frontiers in applied mathematics.

[13]  Sjur Didrik Flåm,et al.  On finite convergence and constraint identification of subgradient projection methods , 1992, Math. Program..

[14]  Grace Wahba,et al.  Spline Models for Observational Data , 1990 .

[15]  S. M. Robinson Bounds for error in the solution set of a perturbed linear program , 1973 .

[16]  Paul S. Bradley,et al.  Parsimonious Least Norm Approximation , 1998, Comput. Optim. Appl..

[17]  Charles A. Ingene,et al.  Specification Searches: Ad Hoc Inference with Nonexperimental Data , 1980 .

[18]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[19]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[20]  L. Ryd,et al.  On bias. , 1994, Acta orthopaedica Scandinavica.

[21]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[22]  O. Mangasarian,et al.  Robust linear programming discrimination of two linearly inseparable sets , 1992 .

[23]  Enrico Tronci 1997 , 1997, Les 25 ans de l’OMC: Une rétrospective en photos.

[24]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[25]  Donald E. Henson,et al.  Relation of tumor size, lymph node status, and survival in 24,740 breast cancer cases , 1989 .

[26]  Venky Harinarayan,et al.  Implementing Data Cubes E ciently , 1996 .

[27]  E. Kaplan,et al.  Nonparametric Estimation from Incomplete Observations , 1958 .

[28]  Federico Girosi,et al.  An improved training algorithm for support vector machines , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[29]  Christopher J. C. Burges,et al.  Simplified Support Vector Decision Rules , 1996, ICML.

[30]  E. Tronci,et al.  1996 , 1997, Affair of the Heart.

[31]  W. N. Street,et al.  Improved Generalization via Tolerant Training , 1998 .

[32]  Ingrid Daubechies,et al.  Time-frequency localization operators: A geometric phase space approach , 1988, IEEE Trans. Inf. Theory.

[33]  Evangelos Simoudis,et al.  Mining business databases , 1996, CACM.

[34]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[35]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[36]  Stephen J. Wright Identifiable Surfaces in Constrained Optimization , 1993 .

[37]  William Nick Street,et al.  Breast Cancer Diagnosis and Prognosis Via Linear Programming , 1995, Oper. Res..

[38]  Florent Cordellier,et al.  On the Fermat—Weber problem with convex cost functions , 1978, Math. Program..

[39]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[40]  Michael C. Ferris,et al.  Finite perturbation of convex programs , 1991 .

[41]  Olvi L. Mangasarian,et al.  Backpropagation Convergence via Deterministic Nonmonotone Perturbed Minimization , 1993, NIPS.

[42]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[43]  DayalUmeshwar,et al.  Data warehousing and OLAP for decision support , 1997 .

[44]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[45]  Siegfried Bös A Realizable Learning Task which Exhibits Overfitting , 1995, NIPS.

[46]  Paul S. Bradley,et al.  Initialization of Iterative Refinement Clustering Algorithms , 1998, KDD.

[47]  Paul S. Bradley,et al.  Feature Selection via Mathematical Programming , 1997, INFORMS J. Comput..

[48]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[49]  Sabine Van Huffel,et al.  The total least squares problem , 1993 .

[50]  U. M. Feyyad Data mining and knowledge discovery: making sense out of data , 1996 .

[51]  Kristin P. Bennett,et al.  Decision Tree Construction Via Linear Programming , 1992 .

[52]  D. Wolpert On Overfitting Avoidance as Bias , 1993 .

[53]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[54]  Mikhail V. Solodov,et al.  Incremental Gradient Algorithms with Stepsizes Bounded Away from Zero , 1998, Comput. Optim. Appl..

[55]  Anthony V. Fiacco,et al.  Nonlinear programming;: Sequential unconstrained minimization techniques , 1968 .

[56]  Michael L. Overton,et al.  A quadratically convergent method for minimizing a sum of euclidean norms , 1983, Math. Program..

[57]  Miron Livny,et al.  Fast Density and Probability Estimation Using CF-Kernel Method for Very Large Databases , 1996 .

[58]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[59]  O. Mangasarian,et al.  Serial and parallel backpropagation convergence via nonmonotone perturbed minimization , 1994 .

[60]  Abraham Silberschatz,et al.  On Subjective Measures of Interestingness in Knowledge Discovery , 1995, KDD.

[61]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[62]  Paul S. Bradley,et al.  Clustering via Concave Minimization , 1996, NIPS.

[63]  Padhraic Smyth,et al.  Clustering Using Monte Carlo Cross-Validation , 1996, KDD.

[64]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[65]  Kevin T. Kelly,et al.  Discovering Causal Structure. , 1989 .

[66]  Olvi L. Mangasarian,et al.  Mathematical Programming in Neural Networks , 1993, INFORMS J. Comput..

[67]  M C Ferris,et al.  Parallel Constraint Distribution , 1991, SIAM J. Optim..

[68]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[69]  Charles W. Therrien,et al.  Discrete Random Signals and Statistical Signal Processing , 1992 .

[70]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[71]  J. Simonoff Multivariate Density Estimation , 1996 .

[72]  Ramasamy Uthurusamy,et al.  Data Mining and Knowledge Discovery in Databases (Introduction to the Special Section). , 1996 .

[73]  Jörg Rech,et al.  Knowledge Discovery in Databases , 2001, Künstliche Intell..

[74]  Douglas W. Nychka,et al.  Discovering Causal Structure , 1989 .

[75]  Donald B. Rubin,et al.  Max-imum Likelihood from Incomplete Data , 1972 .

[76]  K. Bennett,et al.  A support vector machine approach to decision trees , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[77]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[78]  Frank M. Hsu,et al.  Least Square Estimation with Applications to Digital Signal Processing , 1985 .

[79]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[80]  David Haussler,et al.  Mining scientific data , 1996, CACM.

[81]  Olvi L. Mangasarian,et al.  Arbitrary-norm separating plane , 1999, Oper. Res. Lett..

[82]  O. Mangasarian,et al.  Pattern Recognition Via Linear Programming: Theory and Application to Medical Diagnosis , 1989 .

[83]  C. Carter,et al.  Relation of tumor size, lymph node status, and survival in 24,740 breast cancer cases , 1989, Cancer.

[84]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[85]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[86]  Kristin P. Bennett,et al.  Feature minimization within decision trees , 1998 .

[87]  Paul S. Bradley,et al.  Parsimonious side propagation , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[88]  J. J. Moré,et al.  On the identification of active constraints , 1988 .

[89]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[90]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[91]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[92]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[93]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[94]  Miron Livny,et al.  Experience with the Condor distributed batch system , 1990, IEEE Workshop on Experimental Distributed Systems.

[95]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[96]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[97]  Simon Kasif,et al.  OC1: A Randomized Induction of Oblique Decision Trees , 1993, AAAI.

[98]  Michael C. Ferris,et al.  Parallel Variable Distribution , 1994, SIAM J. Optim..

[99]  Kristin P. Bennett,et al.  Bilinear separation of two sets inn-space , 1993, Comput. Optim. Appl..

[100]  O. Mangasarian,et al.  Massive data discrimination via linear support vector machines , 2000 .

[101]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[102]  Olvi L. Mangasarian,et al.  Hybrid misclassification minimization , 1996, Adv. Comput. Math..

[103]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[104]  Olvi L. Mangasarian,et al.  Misclassification minimization , 1994, J. Glob. Optim..

[105]  Keinosuke Fukunaga,et al.  Statistical Pattern Recognition , 1993, Handbook of Pattern Recognition and Computer Vision.

[106]  Jerry M. Mendel,et al.  The constrained total least squares technique and its applications to harmonic superresolution , 1991, IEEE Trans. Signal Process..

[107]  Thomas G. Dietterich,et al.  Readings in Machine Learning , 1991 .

[108]  Olvi L. Mangasarian,et al.  Machine Learning via Polyhedral Concave Minimization , 1996 .

[109]  Paul S. Bradley,et al.  Feature Selection via Concave Minimization and Support Vector Machines , 1998, ICML.

[110]  Surajit Chaudhuri,et al.  Scalable classification over SQL databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[111]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[112]  Bernhard Schölkopf,et al.  Extracting Support Data for a Given Task , 1995, KDD.

[113]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[114]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[115]  Olvi L. Mangasarian Mathematical Programming in Machine Learning , 1996 .

[116]  William Frawley,et al.  Knowledge Discovery in Databases , 1991 .

[117]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[119]  O. Mangasarian Linear and Nonlinear Separation of Patterns by Linear Programming , 1965 .

[120]  Bethany L. Nicholson,et al.  Mathematical Programs with Equilibrium Constraints , 2021, Pyomo — Optimization Modeling in Python.

[121]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[122]  Olvi L. Mangasarian,et al.  Multisurface method of pattern separation , 1968, IEEE Trans. Inf. Theory.

[123]  Yishay Mansour,et al.  An Information-Theoretic Analysis of Hard and Soft Assignment Methods for Clustering , 1997, UAI.