Exploratory data analysis in the context of data mining and resampling. Análisis de Datos Exploratorio en el contexto de extracción de datos y remuestreo.

Today there are quite a few widespread misconceptions of exploratory data analysis (EDA). One of these misperceptions is that EDA is said to be opposed to statistical modeling. Actually, the essence of EDA is not about putting aside all modeling and preconceptions; rather, researchers are urged not to start the analysis with a strong preconception only, and thus modeling is still legitimate in EDA. In addition, the nature of EDA has been changing due to the emergence of new methods and convergence between EDA and other methodologies, such as data mining and resampling. Therefore, conventional conceptual frameworks of EDA might no longer be capable of coping with this trend. In this article, EDA is introduced in the context of data mining and resampling with an emphasis on three goals: cluster detection, variable selection, and pattern recognition. TwoStep clustering, classification trees, and neural networks, which are powerful techniques to accomplish the preceding goals, respectively, are illustrated with concrete examples.

[1]  D. Krus,et al.  Computer Assisted Multicrossvalidation in Regression Analysis , 1982 .

[2]  J. Maindonald Statistical Learning from a Regression Perspective , 2008 .

[3]  Mark John Somers,et al.  Using Artificial Neural Networks to Model Nonlinearity , 2009 .

[4]  John T. Behrens,et al.  Principles and procedures of exploratory data analysis. , 1997 .

[5]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[6]  Yao Wang,et al.  A robust and scalable clustering algorithm for mixed type attributes in large database environment , 2001, KDD '01.

[7]  John W. Tukey,et al.  Analyzing data: Sanctification or detective work? , 1969 .

[8]  J. Tukey Data analysis, computation and mathematics , 1972 .

[9]  Alan H. Fielding,et al.  Cluster and Classification Techniques for the Biosciences , 2006 .

[10]  Jing Luan,et al.  Data Mining and Its Applications in Higher Education , 2002 .

[11]  Angel R. Martinez,et al.  : Exploratory data analysis with MATLAB ® , 2007 .

[12]  Chong Ho Alex Yu Resampling A Conceptual and Procedural Introduction , 2008 .

[13]  Chong Ho Yu,et al.  Exploratory data analysis in the context of data mining and resampling. , 2010 .

[14]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[15]  Magdalini Eirinaki Data Mining for Business Intelligence , 2008 .

[16]  M. B. Wilk,et al.  Data analysis and statistics: an expository overview , 1966, AFIPS '66 (Fall).

[17]  J. Stuart McMenamin,et al.  A Primer on Neural Networks for Forecasting , 1997 .

[18]  I. A. Kieseppä Statistical Model Selection Criteria and the Philosophical Problem of Underdetermination , 2001, The British Journal for the Philosophy of Science.

[19]  Chong Ho Yu,et al.  "Resampling methods: Concepts, Applications, and Justification" , 2002 .

[20]  David C. Hoaglin,et al.  Applications, basics, and computing of exploratory data analysis , 1983 .

[21]  Chong Ho Yu,et al.  Evaluating spatial- and temporal-oriented multi-dimensional visualization techniques , 2002 .

[22]  Halbert White,et al.  Artificial neural networks: an econometric perspective ∗ , 1994 .

[23]  F. Hartwig,et al.  Exploratory Data Analysis , 2008, Using Science in Cybersecurity.

[24]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[25]  Kristine Joy Carpio On Multicollinearity and Artificial Neural Networks , 2011 .

[26]  Stephen L. DesJardins,et al.  Artificial Neural Networks: A New Approach to Predicting Application Behavior , 2002 .

[27]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[28]  Kristen E. DiCerbo,et al.  Exploratory Data Analysis , 2003 .

[29]  Bruce D. Baker,et al.  A comparison of conventional linear regression methods and neural networks for forecasting educational spending , 1999 .

[30]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[31]  Bruce Thompson,et al.  Stepwise Regression and Stepwise Discriminant Analysis Need Not Apply here: A Guidelines Editorial , 1995 .

[32]  D G Altman,et al.  What do we mean by validating a prognostic model? , 2000, Statistics in medicine.

[33]  J. Tukey The Future of Data Analysis , 1962 .

[34]  Chong Ho Yu,et al.  Philosophical Foundations of Quantitative Research Methodology , 2006 .

[35]  Andrew Gelman,et al.  Exploratory Data Analysis for Complex Models , 2004 .

[36]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[37]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[38]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .