Data Mining Methods and Applications

In this chapter, we provide a review of the knowledge discovery process, including data handling, data mining methods and software, and current research activities. The introduction defines and provides a general background on data mining knowledge discovery in databases. In particular, the potential for DM to improve manufacturing processes in industry is discussed. This is followed by an outline of the entire process of knowledge discovery in databases in the second part of the chapter. The third part presents data handling issues, including databases and preparation of the data for analysis. Although these issues are generally considered uninteresting to modelers, the largest portion of the knowledge discovery process is spent handling data. It is also of great importance since the resulting models can only be as good as the data on which they are based. The fourth part is the core of the chapter and describes popular data mining methods, separated as supervised vs. unsupervised learning. In supervised learning, the training data set includes observed output values (“correct answers”) for the given set of inputs. If the outputs are continuous/quantitative, then we have a regression problem. If the outputs are categorical/qualitative, then we have a classification problem. Supervised learning methods are described in the context of both regression and classification (as appropriate), beginning with the simplest case of linear models, then presenting more complex modeling with trees, neural networks, and support vector machines, and concluding with some methods, like nearest neighbor, that are only for classification. In unsupervised learning, the training data set does not contain output values. Unsupervised learning methods are described under two categories: association rules and clustering. Association rules are appropriate for business applications where precise numerical data may not be available while clustering methods are more technically similar to the supervised learning methods presented in this chapter. Finally, this section closes with a review of various software options. The fifth part presents current research projects, involving both industrial and business applications. In the first project, data is collected from monitoring systems, and the objective is to detect unusual activity that may require action. For example, credit card companies monitor customers’ credit card usage in order to detect possible fraud. While methods from statistical process control were developed for similar purposes, the difference lies in the quantity of data. The second project describes data mining tools developed by Genichi Taguchi, who is well known for his industrial work on robust design. The third project tackles quality and productivity improvement in manufacturing industries. Although some detail is given, considerable research is still needed to develop a practical tool for today’s complex manufacturing processes. Finally, the last part provides a brief discussion on remaining problems and future trends.

[1]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[2]  Douglas C. Montgomery,et al.  Introduction to Statistical Quality Control , 1986 .

[3]  Kwok-Leung Tsui,et al.  A Review of Statistical and Fuzzy Quality Control Charts Based on Categorical Data , 1997 .

[4]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[5]  Christopher M. Bishop,et al.  Classification and regression , 1997 .

[6]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[7]  Peter W.H. Smith,et al.  Genetic Programming as a Data-Mining Tool , 2002 .

[8]  V. Barnett,et al.  Applied Linear Statistical Models , 1975 .

[9]  Choudur K. Lakshminarayan,et al.  Markov Random Fields in Pattern Recognition for Semiconductor Manufacturing , 2001, Technometrics.

[10]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[11]  A. Agresti An introduction to categorical data analysis , 1997 .

[12]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[13]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[14]  J. Friedman Stochastic gradient boosting , 2002 .

[15]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[16]  Tom Fawcett,et al.  Activity monitoring: noticing interesting changes in behavior , 1999, KDD '99.

[17]  Douglas W. LaBahn,et al.  New product development cycle time: The influence of project and process factors in small manufacturing companies , 1996 .

[18]  Douglas C. Montgomery,et al.  A Discussion on Statistically-Based Process Monitoring and Control , 1997 .

[19]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[20]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[21]  Trevor Hastie,et al.  Flexible discriminant and mixture models , 2000 .

[22]  R. Lippmann,et al.  An introduction to computing with neural nets , 1987, IEEE ASSP Magazine.

[23]  Charles W. Champ,et al.  Assessment of Multivariate Process Control Techniques , 1997 .

[24]  Allan Y. Wong A statistical approach to identify semiconductor process equipment related yield problems , 1997, 1997 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems.

[25]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[26]  Michael A. West,et al.  Bayesian Forecasting and Dynamic Models (2nd edn) , 1997, J. Oper. Res. Soc..

[27]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[28]  Karl Rihaczek,et al.  1. WHAT IS DATA MINING? , 2019, Data Mining for the Social Sciences.

[29]  Kwok-Leung Tsui,et al.  AN OVERVIEW OF TAGUCHI METHOD AND NEWLY DEVELOPED STATISTICAL METHODS FOR ROBUST DESIGN , 1992 .

[30]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[31]  J. Friedman,et al.  FLEXIBLE PARSIMONIOUS SMOOTHING AND ADDITIVE MODELING , 1989 .

[32]  W. Loh,et al.  Tree-Structured Classification via Generalized Discriminant Analysis. , 1988 .

[33]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .

[34]  George E. P. Box,et al.  QUALITY PRACTICES IN JAPAN. , 1988 .

[35]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[36]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[37]  Malik Beshir Malik,et al.  Applied Linear Regression , 2005, Technometrics.

[38]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[39]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[40]  W. Loh,et al.  Generalized regression trees , 1995 .

[41]  William H. Woodall,et al.  Introduction to Statistical Quality Control, Fifth Edition , 2005 .

[42]  B. Yandell Spline smoothing and nonparametric regression , 1989 .

[43]  David J. Hand,et al.  Discrimination and Classification , 1982 .

[44]  Kim B. Clark,et al.  Product Development and Competitiveness , 1992 .

[45]  D. M. Titterington,et al.  Neural Networks: A Review from a Statistical Perspective , 1994 .

[46]  Subir Chowdhury,et al.  The Mahalanobis-taguchi System , 2000 .

[47]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[48]  Download Book,et al.  Information Visualization in Data Mining and Knowledge Discovery , 2001 .

[49]  Bertram M. Gross,et al.  Event Count Models for International Relations: Generalizations and Applications , 2005 .

[50]  J. Friedman Multivariate adaptive regression splines , 1990 .

[51]  Seoung Bum Kim,et al.  A Review and Analysis of the Mahalanobis—Taguchi System , 2003, Technometrics.

[52]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[53]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[54]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[55]  D. M. Titterington,et al.  [Neural Networks: A Review from Statistical Perspective]: Rejoinder , 1994 .

[56]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[57]  Andrew Kusiak,et al.  Data mining of printed-circuit board defects , 2001, IEEE Trans. Robotics Autom..

[58]  J. Morgan,et al.  Problems in the Analysis of Survey Data, and a Proposal , 1963 .

[59]  Alice Landy,et al.  A data mining tutorial , 1998 .

[60]  Jerome H. Friedman Multivariate adaptive regression splines (with discussion) , 1991 .

[61]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[62]  Andrew Kusiak,et al.  Rough set theory: a data mining tool for semiconductor manufacturing , 2001 .

[63]  A Gordon,et al.  Classification, 2nd Edition , 1999 .

[64]  G. Wahba Spline models for observational data , 1990 .

[65]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[66]  Padhraic J. Smyth,et al.  Hidden Markov models for fault detection in dynamic systems , 1993 .

[67]  Robert Tibshirani,et al.  Discriminant Adaptive Nearest Neighbor Classification , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[68]  Rajesh Jugulum,et al.  The Mahalanobis-Taguchi strategy : a pattern technology system , 2002 .

[69]  W. Loh,et al.  Tree-Structured Classification Via Generalized Discriminant Analysis: Rejoinder , 1988 .

[70]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[71]  G. Geoffrey Vining,et al.  Taguchi's parameter design: a panel discussion , 1992 .

[72]  Wolfgang Banzhaf,et al.  Genetic Programming: An Introduction , 1997 .

[73]  Heikki Topi,et al.  A Review of Software Packages for Data Mining , 2003 .

[74]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[75]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[76]  田口 玄一,et al.  Introduction to quality engineering : designing quality into products and processes , 1986 .

[77]  K. Tsui,et al.  Identification and Quantification in Multivariate Quality Control Problems , 1994 .

[78]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[79]  David Biggs,et al.  A method of choosing multiway partitions for classification and decision trees , 1991 .

[80]  Michael J. A. Berry,et al.  Mastering Data Mining: The Art and Science of Customer Relationship Management , 1999 .

[81]  Halbert White,et al.  Learning in Artificial Neural Networks: A Statistical Perspective , 1989, Neural Computation.

[82]  K. Tsui A critical look at Taguchi's modelling approach for robust design , 1996 .

[83]  M. C. Jones,et al.  Spline Smoothing and Nonparametric Regression. , 1989 .

[84]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory, Third Edition , 1989, Springer Series in Information Sciences.

[85]  Heikki Mannila,et al.  Verkamo: Fast Discovery of Association Rules , 1996, KDD 1996.

[86]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .