Data Mining Using SAS Applications

statistics majors who could use it to get a  avor of applications using R. Unfortunately, even this use may be a stretch, because many of the examples tend to be artiŽ cial or small “toy” datasets that seem out of place in today’s large data world. Nevertheless, it is a possibility. There are other problems with the text. The layout is dull and unimaginative. The paper is uncoated, so it feels cheap. Worst of all, for a book that has “Graphics” in its title, the quality of the graphs is poor. Many feel small and cramped to me (e.g., pp. 33, 159, 189, 262). There are numerous references to color, but all the graphs are in black, white, and gray scale! In several instances, this renders the example unintelligible. I also found several typos and what appear to be some coding errors. For instance, two of the symbols in the legend on page 283 are identical. But these are relatively minor distractions compared to the major problems discussed previously. Because of them, I believe this book should be avoided by all but those who already have a strong, extensive statistical background or instructors who will pick and choose various topics (most of the chapters are self-contained) for which they will provide the necessary motivation and development themselves. I am sorry to report such a negative assessment. As I said at the outset, I believe the authors’ intent was laudable; it was the execution that faltered. But this begs the question: Can there be a nonmathematical methods text that uses a suitable statistical software environment to present a broad range of modern statistical methods that practitioners should be using in their work? It is hard to argue that many of the topics chosen should not be in the armamentarium of data analysts, or that scientists should not at least be exposed to and aware of them. If that is the case, then “a mile wide and an inch deep” is an inevitable consequence, and there seems to be little room to provide the insight and detail that I Ž nd critically lacking. I do not have a solution for this conundrum. It is certainly possible that more careful focus, elimination of some useful but ultimately extraneous topics (trees and time series would be two of my choices), and a clearer statement of and reliance on appropriate prerequisites might go a long way. But I fear not. As the authors themselves seem to understand and articulate in their “folksy” statistical advice, data analysis is partly science and partly art. Certainly one needs a Ž rm grounding in the underlying formalism. How can one understand linear models without understanding contrasts? How can one agree or disagree with p values or post hoc tests without a solid grounding in the nuts and bolts of inference? How can one understand the trade-offs of model choice and prediction without understanding the bias/variance trade-off? But one also needs data analysis experience. As George Box famously said, “All models are wrong, but some are useful.” In my view, this means that one needs to comprehend empirical models as a framework for capturing and describing regularity in data (“signal” and “noise”). Sometimes the structure of the framework is important (which variables are important, which are not), sometimes only the predictions that one makes from it are important, and sometimes both are important. But to do this successfully, one must do more than turn the crank on a sausage grinder. Such issues as “aberrant” data, complex covariance structures and clustering (which data are relevant), and “overŽ tting” cannot be neatly packaged in a statistical menu and served up. It requires understanding and experience to determine how (or even whether) the study design and data can serve the scientiŽ c needs—to separate wheat from chaff in the issues at hand. I think many of the authors’ remarks may be nearly incomprehensible without such experience. If this sounds more like the philosophy of science than statistics, then so be it. I would like to see more “philosophical” discussion of the relationship of empirical models to deterministic models in any statistical text aimed at scientists. Most scientists I know view empirical modeling as at best a semi-legitimate cousin of “real science.” In their view, statistics is primarily a means of establishing a p value or some other stamp of authenticity and not an essential component of “ the scientiŽ c method.” If scientists do not believe or understand how careful statistical data analysis can inform and enrich their work, how can they learn and effectively apply statistical methodology? The authors appear to understand and allude to these matters at several points in the book (obviously so, in their comments about p values), but I think these matters deserve more careful consideration. To give just one example, the authors mention the need for parametric “biased” models for small datasets versus more data-driven,  exible models for larger datasets (but what are “small” and “large”?) in their introduction to regression trees on page 260. As they stand, I think these remarks will be incomprehensible to most scientists. But this issue is clearly central to the matter at hand—why should one use regression trees, nonparametric curve smoothers, or the many other modern tools that R makes available in the Ž rst place? This would have been a natural place for the authors to clarify these issues in an extended discussion. I regret that they did not. As I stated earlier, it is not at all clear to me how any book could cover so much ground without demanding much more statistical background and data analysis experience from its readers. Even worse, I think there are more important topics that should have been added! Chief among these is experimental design, about which the authors have almost nothing to say. I believe that a book on statistical methods targeted at scientists must say something about the crucial role of study design in determining exactly what one can subsequently learn from data analysis. Unfortunately, that is not all I can think of; robust methods, nonlinear models, censoring, and missing data are some other topics that deserved (more) space. My experience has been that practicing scientists are often confronted by these issues without recognizing their statistical consequences and consequently use ad hoc analyses—for example, treating censored values as complete data, subjectively “masking” outliers—that can considerably distort their data analyses. Science pays a price for such poor data analysis. To summarize, this book takes a risk and, in my opinion, fails. However, I believe that the authors’ intent was right, and that we can learn much from the failure. I hope that other prospective authors will consider the important issues that have been raised and make their own efforts in this direction. Not only statistical education, but also the larger issue of what role statistics and statisticians should play in science and technology, are of concern here. We need to do better.