Confirmatory Factor Analysis for Applied Research

Data Mining Methods and Models is the second volume of a three-book series on data mining authored by Larose. The following review was performed independently of LaRose’s other two books. Paraphrasing from the Preface, the goal of this book is to “explore the process of data mining from the point of view of model building.” Nevertheless, the reader will soon be aware that this book is not intended to provide a systematic or comprehensive coverage of various data mining algorithms. Instead, it considers supervised learning or predictive modeling only, and it walks the reader through the data mining process merely with a few selected modeling methods such as (generalized) linear modeling and the Bayesian approach. The book has seven chapters. Chapter 1 introduces dimension reduction, with a focus on principal components analysis (PCA) types of techniques. Chapters 2, 3, and 4 provide a detailed coverage of simple linear regression, multiple linear regression, and logistic regression, respectively. Chapter 5 introduces naive Bayes estimation and Bayesian networks. In Chapter 6, the basic idea of genetic algorithms is discussed. Finally, Chapter 7 presents a case study example of modeling response to direct mail marketing within the CRISP (crossindustry standard process) framework. This book is very easy to read, and this is absolutely the strength which many readers, especially those nonstatistically oriented ones, will greatly appreciate. Predictive modeling is perhaps the most technical part in a data mining process. The author has done an excellent job in making this difficult topic accessible to a broad audience. For example, I like the way in which Bayesian networks are introduced in Chapter 5. After the reader goes through a churn example on naive Bayes estimation in a step-by-step manner, Bayesian belief networks become easily understood as natural extensions. The overall style of the book is clear and patient. The main limitation of the book is its limited coverage. An inspired reader would expect to see a much more extended list of topics. Hastie, Tibishirani, and Friedman (2001) gave a full and more technical account of various data mining algorithms. The inclusion of genetic algorithms in Chapter 6 seems novel when compared to Hastie, Tibishirani, and Friedman (2001), but at the same time, a little unexpected as a separate chapter, since a genetic algorithm involves a stochastics search scheme, which is somewhat involved given the elementary nature of this text. Another noteworthy issue is that the author does not make an attempt to distinguish between conventional statistical analysis and data mining. I found a few errors. On Page 25, for example, it should be ai = 1, instead of ai = 1/4. Also, in the frame on the top of Page 211, it might have been “Posterior Odds,” instead of “Posterior Odds Ratio.” The book uses three different software packages to implement the ideas including SPSS with Clementine, Minitab, and WEKA, which might not be appealing. On the other hand, it is justifiable as it allows one to perform data mining with affordable costs. In summary, I recommend this fairly readable book for adoption in a graduate-level introductory course on data mining, especially when the students come from varied backgrounds.