Exploratory data analysis in the context of data mining and resampling.

Today there are quite a few widespread misconceptions of exploratory data analysis (EDA). One of these misperceptions is that EDA is said to be opposed to statistical modeling. Actually, the essence of EDA is not about putting aside all modeling and preconceptions; rather, researchers are urged not to start the analysis with a strong preconception only, and thus modeling is still legitimate in EDA. In addition, the nature of EDA has been changing due to the emergence of new methods and convergence between EDA and other methodologies, such as data mining and resampling. Therefore, conventional conceptual frameworks of EDA might no longer be capable of coping with this trend. In this article, EDA is introduced in the context of data mining and resampling with an emphasis on three goals: cluster detection, variable selection, and pattern recognition. TwoStep clustering, classification trees, and neural networks, which are powerful techniques to accomplish the preceding goals, respectively, are illustrated with concrete examples.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  D G Altman,et al.  What do we mean by validating a prognostic model? , 2000, Statistics in medicine.

[3]  J. Tukey Data analysis, computation and mathematics , 1972 .

[4]  Jing Luan,et al.  Data Mining and Its Applications in Higher Education , 2002 .

[5]  Chong Ho Alex Yu Resampling A Conceptual and Procedural Introduction , 2008 .

[6]  Alan H. Fielding,et al.  Cluster and Classification Techniques for the Biosciences , 2006 .

[7]  D. Krus,et al.  Computer Assisted Multicrossvalidation in Regression Analysis , 1982 .

[8]  Halbert White,et al.  Artificial neural networks: an econometric perspective ∗ , 1994 .

[9]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[10]  John W. Tukey,et al.  Analyzing data: Sanctification or detective work? , 1969 .

[11]  Chong Ho Yu Philosophical Foundations of Quantitative Research Methodology , 2006 .

[12]  Mark John Somers,et al.  Using Artificial Neural Networks to Model Nonlinearity , 2009 .

[13]  Stephen L. DesJardins,et al.  Artificial Neural Networks: A New Approach to Predicting Application Behavior , 2002 .

[14]  Andrew Gelman,et al.  Exploratory Data Analysis for Complex Models , 2004 .

[15]  Jing Luan,et al.  Knowledge management : building a competitive advantage in higher education , 2002 .

[16]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[17]  J. Osborne Notes on the use of data transformations. , 2002 .

[18]  M. B. Wilk,et al.  Data analysis and statistics: an expository overview , 1966, AFIPS '66 (Fall).

[19]  J. Stuart McMenamin,et al.  A Primer on Neural Networks for Forecasting , 1997 .

[20]  R. Viertl On the Future of Data Analysis , 2002 .

[21]  John T. Behrens,et al.  Principles and procedures of exploratory data analysis. , 1997 .

[22]  Galit Shmueli,et al.  To Explain or To Predict? , 2010, 1101.0891.

[23]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[24]  Robert H. Kushler,et al.  Exploratory Data Analysis With MATLAB® , 2006, Technometrics.

[25]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[26]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[27]  Jay L. Devore,et al.  Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining , 2007 .

[28]  Chong Ho Yu,et al.  Evaluating spatial- and temporal-oriented multi-dimensional visualization techniques , 2002 .

[29]  Bruce D. Baker,et al.  A comparison of conventional linear regression methods and neural networks for forecasting educational spending , 1999 .

[30]  Bruce Thompson,et al.  Stepwise Regression and Stepwise Discriminant Analysis Need Not Apply here: A Guidelines Editorial , 1995 .

[31]  I. A. Kieseppä Statistical Model Selection Criteria and the Philosophical Problem of Underdetermination , 2001, The British Journal for the Philosophy of Science.

[32]  Chong Ho Yu,et al.  "Resampling methods: Concepts, Applications, and Justification" , 2002 .

[33]  Chong Ho Alex Yu A Model Must Be Wrong to be Useful: The Role of Linear Modeling and False Assumptions in Theoretical Explanation~!2010-01-05~!2010-04-18~!2010-07-21~! , 2010 .

[34]  Yao Wang,et al.  A robust and scalable clustering algorithm for mixed type attributes in large database environment , 2001, KDD '01.

[35]  Kristine Joy Carpio On Multicollinearity and Artificial Neural Networks , 2011 .

[36]  Shalabh Statistical Learning from a Regression Perspective , 2009 .

[37]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[38]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[39]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[40]  John W. Tukey,et al.  Philosophy and principles of data analysis , 1986 .

[41]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..