Data Mining: Machine Learning and Statistical Techniques

The interdisciplinary field of Data Mining (DM) arises from the confluence of statistics and machine learning (artificial intelligence). It provides a technology that helps to analyze and understand the information contained in a database, and it has been used in a large number of fields or applications. Specifically, the concept DM derives from the similarity between the search for valuable information in databases and mining valuable minerals in a mountain. The idea is that the raw material is the data to analyse, and we use a set of learning algorithms acting as diggers to search for valuable nuggets of information (Bigus, 1996). We offer an applied vision of DM techniques, in order to provide a didactic perspective of the data analysis process of these techniques. We analyze and compare the results from applying machine learning algorithms and statistical techniques, under DM methodology, in searching for knowledge models that show the structures and regularities underlying the data analysed. In this sense, some authors have pointed out that DM consists of “the analysis of (often large) observational datasets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner” (Hand, Mannila & Smyth, 2001), or, more simply, “the search for valuable information in large volumes of data” (Weiss & Indurkhya, 1998), or “the discovery of interesting, unexpected or valuable structures in large databases” (Hand, 2007). Other authors define DM as “the exploration and analysis of large quantities of data in order to discover meaningful patterns and rules” (Berry & Linoff, 2004). These definitions make it clear that DM is an appropriate process for detecting relationships and patterns in large databases (although we point out that it can also be applied in relatively small databases). In this sense, the concept of Knowledge Discovery in Databases (KDD) has been frequently used in the literature to define this process (Han & Kamber, 2000, 2006; Hand et al., 2001), specifying that DM is a stage of the process, and highlighting the need for a previous stage of integration and collection of data (if we start with large raw databases), and also the stage of cleaning and preparing data (data pre-processing) before building descriptive/predictive models in the DM stage (applying suitable techniques to the analysis requirements). On the other hand, several authors have used the concept of DM (instead of KDD) to refer to the complete process (Bigus, 1996; Two Crows, 1999; Paul, Guatam & Balint, 2002; Kantardzic, 2003; Ye, 2003; Larose, 2005).

[1]  Olivia Parr Rud,et al.  Data Mining Cookbook: Modeling Data for Marketing, Risk, and Customer Relationship Management , 2000 .

[2]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[3]  Paul Gray,et al.  Introduction to Data Mining and Knowledge Discovery , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.

[4]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[5]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Michael J. A. Berry,et al.  Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management , 2004 .

[8]  Richard J. Cleary Applied Data Mining: Statistical Methods for Business and Industry , 2006 .

[9]  Eric R. Ziegel,et al.  Data Mining Cookbook , 2002, Technometrics.

[10]  G. V. Kass An Exploratory Technique for Investigating Large Quantities of Categorical Data , 1980 .

[11]  Nong Ye,et al.  The Handbook of Data Mining , 2003 .

[12]  Joseph P. Bigus,et al.  Data mining with neural networks: solving business problems from application development to decision support , 1996 .

[13]  Elena Gervilla García,et al.  The methodology of Data Mining. An application to alcohol consumption in teenagers. , 2009, Adicciones.

[14]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[15]  John Elder,et al.  Handbook of Statistical Analysis and Data Mining Applications , 2009 .

[16]  D. Edwards Data Mining: Concepts, Models, Methods, and Algorithms , 2003 .

[17]  Linda Trocine,et al.  Data Mining and Traditional Regression , 2003 .

[18]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[19]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[20]  Magdalini Eirinaki Data Mining for Business Intelligence , 2008 .

[21]  John Wang,et al.  Encyclopedia of Data Warehousing and Mining , 2005 .

[22]  E GervillaGarcía,et al.  The methodology of Data Mining. An application to alcohol consumption in teenagers. , 2009 .

[23]  Alan T. Schroeder Data mining with neural networks: Solving business problems from application development to decision support , 1997 .

[24]  Philipp Slusallek,et al.  Introduction to real-time ray tracing , 2005, SIGGRAPH Courses.

[25]  A. K. Pujari,et al.  Data Mining Techniques , 2006 .

[26]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[27]  Alfonso Palmer,et al.  Numeric sensitivity analysis applied to feedforward neural networks , 2003, Neural Computing & Applications.

[28]  Aijun An Classification Methods , 2009, Encyclopedia of Data Warehousing and Mining.

[29]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[30]  Ian Witten,et al.  Data Mining , 2000 .

[31]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .

[32]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[33]  Sholom M. Weiss,et al.  Predictive data mining - a practical guide , 1997 .

[34]  Mehmed Kantardzic,et al.  Data Mining: Concepts, Models, Methods, and Algorithms , 2002 .

[35]  Thomas J. Watson,et al.  An empirical study of the naive Bayes classifier , 2001 .

[36]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[37]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[38]  Daniel T. Larose,et al.  Data mining methods and models , 2006 .

[39]  D. Hand,et al.  Idiot's Bayes—Not So Stupid After All? , 2001 .

[40]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[41]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .