Regression Analysis by Example

This book serves well as an introduction to the speciŽ c area of methods for detecting and correcting model violations in the standard linear regression model. In the preface of the book, it states that the authors view regression analysis as a set of data-analytic techniques that examine the interrelationship among a given set of variables. They approach the topic from an informal analysis point of view directed at uncovering patterns in the data rather than from the formal statistical-tests-and-probabilities point of view. The book relies heavily on graphical methods and intuitive explanations to achieve this. Several examples are introduced early in the book and are drawn on throughout the later chapters to demonstrate the different methods discussed. The examples are more from sociological and economic areas than from engineering Ž elds; however, they do demonstrate the given techniques well. There are no mathematical derivations for any of the results, although references are given throughout. The authors present the different subjects at a sufŽ cient level of detail so that most standard regression packages can be used for the methods discussed. The foundations of regression analysis are summarized without much detail, so the reader needs to be knowledgeable of and comfortable with multiple regression and model building to get the most beneŽ t from the material. The chapter layout is as follows: Chapter 1: Introduction. All of the datasets used in the examples throughout the book are available from the Web. The authors introduce a set of steps, cyclical in nature, that they use for a given regression analysis problem. They follow this process closely throughout the book. Chapter 2: Simple Linear Regression. This chapter outlines, in very general terms, the distribution theory, conŽ dence intervals, and hypothesis tests for the simple regression model. Equations/expressions are given with no derivations. There are typographical errors in the equations and table numbers throughout the chapter. Chapter 3: Multiple Linear Regression. This chapter introduces the multiple linear regression model, again in very general terms. It contains one of the better discussions/interpretations of partial regression coefŽ cients that I have read and includes a very effective example to emphasize the point. The chapter also has a good general introduction to the model comparison approach in regression analysis, discussing coefŽ cient testing based on full and reduced models. It also introduces the idea of model constraints and how to test these using a model comparison approach. The details of applying these methods are also given in later chapters with some speciŽ c examples. There is an appendix at the end of this chapter giving details of multiple regression in matrix notation in terms of the estimators and residuals and their properties. Chapter 4: Regression Diagnostics—Detection of Model Violations. This chapter addresses the issue of assumptions validation and the detection and correction of model violations. The authors discuss both standard techniques and more recently developed techniques to address nonnormality, outliers, high leverage points, and in uential observations. The applications of these techniques are demonstrated repeatedly in the examples discussed in later chapters. Each of the remaining chapters in the book deals with a speciŽ c type of regression problem or situation. Chapter 5: Qualitative Variables as Predictors. The authors do an excellent job of explaining how model parameterization works with qualitative variables. They also do an effective job of introducing the methods of analysis of covariance (ANCOVA) in terms of generating and comparing models with same and/or different slopes and/or intercepts. However, the material does not address the idea of degrees of freedom in these situations. This is one of the few drawbacks I see in the material. A good understanding of degrees of freedom in ANCOVA models is essential in specifying speciŽ c tests and evaluating model performance. Chapter 6: Transformation of Variables. This chapter provides a general overview of transformations of variables and focuses on three traditional situations where transformations can be applied—(1) to achieve linearity of the model, (2) to achieve normality of the errors, and (3) to stabilize the variance. Chapter 7: Weighted Least Squares. This chapter addresses the heterogeneity of variance assumption. The material has a good intuitive explanation of two speciŽ c situations where ordinary least squares are equivalent to weighted least squares—(1) when the variance of the residuals is a function of one of the predictor variables and (2) when the response variables are means with different sample sizes. Chapter 8: The Problem of Correlated Errors. This chapter addresses the issue of the independent-errors assumption and techniques used to identify and correct the problem. It gives a good general overview of the Durbin–Watson statistic and some of its limitations, transformations to remove autocorrelation, and iterative estimation with autocorrelated errors. Chapter 9: Analysis of Collinear Data. Chapter 10: Biased Estimation of Regression CoefŽ cients. Both Chapters 9 and 10 present methods for the detection and correction of the collinearity problem. This is one of the best discussions in the book. Different techniques are used to identify if collinearity exists and different methods are used to correct the situation. The authors address three speciŽ c questions: (1) How does multicollinearity affect inference and forecasting? (2) How can it be detected? (3) What can be done to resolve the difŽ culties associated with it? Chapter 9 contains a brief appendix on using principal components to detect multicollinearity in matrix notation. Chapter 10 contains a brief appendix on ridge regression in matrix notation. Chapter 11: Variable Selection Procedures. This chapter starts out by reviewing the standard methodology behind forward and backward selection and introduces different criteria useful for comparing results across different models. It contains two good examples using the methods introduced in Chapters 9 and 10 (principal components and ridge regression) as a means of evaluating situations containing a large number of predictor variables. Chapter 12: Logistic Regression. This chapter contains a good overview of the aspects of logistic regression. It is one of the more mathematical chapters in the book; however, the authors present the material in a very reader-friendly manner. There is one example given using Ž nancial data that goes into details of diagnostics measures of the model, judging the Ž t of the model, and the model comparison approach using the chi-squared statistic. The authors state that the primary focus of the book is on the detection and correction of violations of the basic linear model assumptions as a means of achieving a thorough and informative analysis of the data. The book covers only univariate regression, both simple and multiple, linear, and, to some extent, nonlinear (under linearizeable conditions). The authors deal mainly with the least squares method of estimation and, to some extent, weighted least squares. They touch on some of the aspects of other estimation methods, such as maximum likelihood. Much of the material and examples in the second half of the book uses the methods of ridge regression and principal components repeatedly. They do a very thorough and effective job of demonstrating the variety of methods that are available to help the analyst under different situations. This book is not a stand-alone regression text, and I do not believe it was intended to be. Overall, the material that is covered is an excellent introduction to a substantial collection of diagnostic tools that aid in uncovering hidden structures in one’s data. I would recommend the book as an addition to any applied statistician’s library.