Outlier Detection by Regression Diagnostics in Large Data

Regression analysis is a well known supervised learning technique. To estimate and justify an effective model from regression analysis it is necessary to check and preprocess the data set. Without outliers (noise) it is impossible to get a real data. Areas in bio-informatics, astronomy, image analysis, computer vision etc, large or fat data appear with unusual observations (outliers) very naturally. In these industries robust regression are commonly used in model building process. But robust regression methods are not good enough in large and/or high dimensional data. Checking raw data for outliers in regression is regression diagnostics. Robust regression and regression diagnostics are two complementary ideas and any one is not enough for studying a contaminated data. Most of the popular diagnostic methods are not sufficient for large data because of masking and swamping. In this article, both of the above ideas are shortly discussed and we show a new measure can effectively identify outliers (influential observations) in linear regression for large data.

[1]  Ali S. Hadi,et al.  A new measure of overall potential influence in linear regression , 1992 .

[2]  Mohammed Nasser,et al.  Regression Diagnostics for Multiple Model Step Data , 2009, 2009 International Conference on Digital Image Processing.

[3]  J. Simonoff,et al.  Procedures for the Identification of Multiple Outliers in Linear Models , 1993 .

[4]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[5]  A. Fielding Sensitivity Analysis in Linear Regression , 1990 .

[6]  Daniel Peña,et al.  A New Statistic for Influence in Linear Regression , 2005, Technometrics.

[7]  C. Jennison,et al.  Robust Statistics: The Approach Based on Influence Functions , 1987 .

[8]  P. J. Huber Robust Regression: Asymptotics, Conjectures and Monte Carlo , 1973 .

[9]  A. Hossain,et al.  A comparative study on detection of influential observations in linear regression , 1991 .

[10]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[11]  David A. Belsley,et al.  Regression Analysis and its Application: A Data-Oriented Approach.@@@Applied Linear Regression.@@@Regression Diagnostics: Identifying Influential Data and Sources of Collinearity , 1981 .

[12]  G. Box NON-NORMALITY AND TESTS ON VARIANCES , 1953 .

[13]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[14]  S. Chatterjee Sensitivity analysis in linear regression , 1988 .

[15]  I-Cheng Yeh,et al.  Modeling of strength of high-performance concrete using artificial neural networks , 1998 .

[16]  F. Hampel Contributions to the theory of robust estimation , 1968 .

[17]  R. Cook Detection of influential observation in linear regression , 2000 .

[18]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[19]  Werner A. Stahel,et al.  Robust Statistics: The Approach Based on Influence Functions , 1987 .