Engineering Statistics
暂无分享,去创建一个
over the individual SAS procedure for that algorithm, and a how-to for running the SAS macro). Case studies are then given for further illustration. Chapter 3 looks at exploratory data analysis. Simple descriptive statistics (a variety of measures of location, dispersion, and deviation from the normal distribution—skewness and kurtosis) and graphical techniques (frequency histograms, boxplots, and Q–Q plots) are described for continuous variables. Likewise, simple descriptive statistics (cross-tabulations) and graphical techniques (bar charts and pie charts) are given for categorical variables. Macros that produce descriptive statistics and graphs and have additional features, such as the construction of new datasets that exclude outliers, are then introduced. Unsupervised learning is covered in Chapter 4. The algorithms discussed here (principal component analysis, exploratory factor analysis, and disjoint cluster analysis) are statistical in nature with certain distributional assumptions. The basic concepts and terminology, as well as such topics as methods for estimating the number of principal components and the optimum number of clusters, are mentioned. The macros described in this chapter implement these algorithms and also produce scatterplot matrices, test for multivariate skewness and kurtosis, and detect deviation from multivariate normality (Q–Q plots) and outliers. Next, prediction and classi cation (supervised learning) are discussed in Chapters 5 and 6. Multiple linear regression and binary logistic regression are the prediction algorithms treated. Following brief nontechnical descriptions and de nitions of basic concepts and terminology in multiple linear and binary logistic regression modeling, the author brie y describes model-building steps, including graphical exploratory data analysis (to help understand the relationships between the predictor variable and potential predictor variables), model selection, and checking for violations of model assumptions (e.g., nonnormality of residuals for multiple linear regression). SAS macros that incorporate these steps are then presented and illustrated by case studies. Macros for producing lift charts (there is no procedure or option in SAS to produce these automatically) and scoring new datasets using the derived models are also given. Classi cation methods (discriminant analysis and classi cation trees based on the CHAID method) are treated in a similar fashion in Chapter 6. The book ends with a brief discussion of databases (data warehousing) and some algorithms of nonstatistical origin (arti cial neural networks and market basket analysis). First, looking at the book itself, we nd that it is essentially devoted to providing instructions on using a set of SAS macros for performing data mining. It is not self-contained and does not attempt to give complete descriptions of or insight into the data mining algorithms. With the aid of case studies, it generally does a good job of instructing the reader on how to use the macros that accompany it. However, an explicit description of the structure of the software (macro-call SAS programs set up the menus and call the macros) would make it easier for the reader to get the macros up and running. For example, each macro-call program must contain the correct location of the related macro for the macro to run. This typically requires that the user change the location given in the macro-call program to re ect the user’s system. One more comment: Using fewer acronyms would make the text easier to follow. Turning to the macros themselves, we nd that they provide a degree of user friendliness because they use menus to input parameters. However, they do not permit the user to browse for le locations and do not automatically populate the various macros parameters with the possible candidates when a data le is speci ed. By packaging supplemental procedures with the main procedure, the macros provide convenience and also serve to guide the analysis to include the requisite diagnostic tests. On the other hand, the set of macros does not contain a wide range of algorithms (e.g., no robust regression and no neural nets); the ones given are often statistical in nature with distributional assumptions.