Robust-Diagnostic Regression: A Prelude for Inducing Reliable Knowledge from Regression

Regression lies heart in statistics, it is the one of the most important branch of multivariate techniques available for extracting knowledge in almost every field of study and research. Nowadays, it has drawn a huge interest to perform the tasks with different fields like machine learning, pattern recognition and data mining. Investigating outlier (exceptional) is a century long problem to the data analyst and researchers. Blind application of data could have dangerous consequences and leading to discovery of meaningless patterns and carrying to the imperfect knowledge. As a result of digital revolution and the growth of the Internet and Intranet data continues to be accumulated at an exponential rate and thereby importance of detecting outliers and study their costs and benefits as a tool for reliable knowledge discovery claims perfect attention. Investigating outliers in regression has been paid great value for the last few decades within two frames of thoughts in the name of robust regression and regression diagnostics. Robust regression first wants to fit a regression to the majority of the data and then to discover outliers as those points that possess large residuals from the robust output whereas in regression diagnostics one first finds the outliers, delete/correct them and then fit the regular data by classical (usual) methods. At the beginning there seems to be much confusion but now the researchers reach to the consensus, robustness and diagnostics are two complementary approaches to the analysis of data and any one is not good enough. In this chapter, we discuss both of them under the unique spectrum of regression diagnostics. Chapter expresses the necessity and views of regression diagnostics as well as presents several contemporary methods through numerical examples in linear regression within each aforesaid category together with current challenges and

[1]  V. Yohai,et al.  Robust Statistics: Theory and Methods , 2006 .

[2]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[3]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[4]  S. Weisberg,et al.  Residuals and Influence in Regression , 1982 .

[5]  S. Chatterjee,et al.  Regression Analysis by Example , 1979 .

[6]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[7]  R. Cook Assessment of Local Influence , 1986 .

[8]  Abdul Nurunnabi,et al.  A Diagnostic Measure for Influential Observations in Linear Regression , 2011 .

[9]  C. W. Coakley,et al.  A Bounded Influence, High Breakdown, Efficient Regression Estimator , 1993 .

[10]  G. V. Kass,et al.  Location of Several Outliers in Multiple-Regression Data Using Elemental Sets , 1984 .

[11]  P. J. Huber Robust Estimation of a Location Parameter , 1964 .

[12]  Ali S. Hadi,et al.  A new measure of overall potential influence in linear regression , 1992 .

[13]  Peter J. Huber,et al.  Robust Statistics , 2005, Wiley Series in Probability and Statistics.

[14]  R. Gnanadesikan,et al.  Probability plotting methods for the analysis of data. , 1968, Biometrika.

[15]  W. W. Muir,et al.  Regression Diagnostics: Identifying Influential Data and Sources of Collinearity , 1980 .

[16]  R. Cook Detection of influential observation in linear regression , 2000 .

[17]  J. W. Gorman,et al.  Fitting Equations to Data. , 1973 .

[18]  Francisco J. Prieto,et al.  Multivariate Outlier Detection and Robust Covariance Matrix Estimation , 2001, Technometrics.

[19]  Michael J. A. Berry,et al.  Data mining techniques - for marketing, sales, and customer support , 1997, Wiley computer publishing.

[20]  V. Yohai HIGH BREAKDOWN-POINT AND HIGH EFFICIENCY ROBUST ESTIMATES FOR REGRESSION , 1987 .

[21]  Zhaohui Wu,et al.  Enhancing Reliability throughout Knowledge Discovery Process , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[22]  G. Box NON-NORMALITY AND TESTS ON VARIANCES , 1953 .

[23]  C. Jennison,et al.  Robust Statistics: The Approach Based on Influence Functions , 1987 .

[24]  R. Welsch,et al.  The Hat Matrix in Regression and ANOVA , 1978 .

[25]  Roy E. Welsch,et al.  Efficient Computing of Regression Diagnostics , 1981 .

[26]  P. J. Huber Robust Regression: Asymptotics, Conjectures and Monte Carlo , 1973 .

[27]  A. H. M. Rahmatullah Imon,et al.  Identifying multiple influential observations in linear regression , 2005 .

[28]  Daryl Pregibon,et al.  A statistical perspective on KDD , 1995, KDD 1995.

[29]  Peter J. Huber,et al.  Between Robustness and Diagnostics , 1991 .

[30]  Stefan Van Aelst,et al.  Fast and robust bootstrap for LTS , 2005, Comput. Stat. Data Anal..

[31]  Ruben H. Zamar,et al.  Robust Estimates of Location and Dispersion for High-Dimensional Datasets , 2002, Technometrics.

[32]  Ali S. Hadi,et al.  Regression Analysis by Example: Chatterjee/Regression , 2006 .

[33]  D. G. Simpson,et al.  On One-Step GM Estimates and Stability of Inferences in Linear Regression , 1992 .

[34]  N. Draper,et al.  Applied Regression Analysis , 1966 .

[35]  H. Mannila,et al.  Data mining: machine learning, statistics, and databases , 1996, Proceedings of 8th International Conference on Scientific and Statistical Data Base Management.

[36]  P. Rousseeuw,et al.  Unmasking Multivariate Outliers and Leverage Points , 1990 .

[37]  J. Simonoff,et al.  Procedures for the Identification of Multiple Outliers in Linear Models , 1993 .

[38]  Anthony C. Atkinson,et al.  Robust Diagnostic Regression Analysis , 2000 .

[39]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[40]  O. Hössjer Rank-Based Estimates in the Linear Model with High Breakdown Point , 1994 .

[41]  J. Tukey The Future of Data Analysis , 1962 .

[42]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[43]  A. C. Atkinson,et al.  Two graphical displays for outlying and influential observations in regression , 1981 .

[44]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[45]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[46]  P. Rousseeuw,et al.  A fast algorithm for the minimum covariance determinant estimator , 1999 .

[47]  A. Hadi,et al.  BACON: blocked adaptive computationally efficient outlier nominators , 2000 .

[48]  S. Chatterjee Sensitivity analysis in linear regression , 1988 .

[49]  J H Ellenberg,et al.  Testing for a single outlier from a general linear regression. , 1976, Biometrics.

[50]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[51]  S. Chatterjee,et al.  Influential Observations, High Leverage Points, and Outliers in Linear Regression , 1986 .