Regression diagnostics in large and high dimensional data

ldquoLearning methodsrdquo play a key role in the fields of statistics, data mining, and artificial intelligence, intersecting with areas of engineering and other disciplines. These methods for analyzing and modeling data come in two flavors: supervised and unsupervised learning. Regression analysis and classification are two well known supervised learning techniques. To get an effective model from regression analysis it is necessary to check and preprocess the data set in astronomy, bio-informatics, image analysis, computer vision etc, especially when the data sets are large and high dimensional. In these industries large or fat data appear with unusual observations (outliers) very naturally. Checking raw data for outliers in regression is regression diagnostics. Most of the popular diagnostic methods are not good enough for large and high dimensional data. The aim of this paper is to provide a new measure for identifying influential observations in linear regression for large high dimensional data.

[1]  Hongxing He,et al.  A comparative study of RNN for outlier detection in data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[2]  S. Chatterjee Sensitivity analysis in linear regression , 1988 .

[3]  A. Hossain,et al.  A comparative study on detection of influential observations in linear regression , 1991 .

[4]  W. W. Muir,et al.  Regression Diagnostics: Identifying Influential Data and Sources of Collinearity , 1980 .

[5]  R. Cook Detection of influential observation in linear regression , 2000 .

[6]  E. Acuña,et al.  A Meta analysis study of outlier detection methods in classification , 2004 .

[7]  Shashi Shekhar,et al.  Detecting graph-based spatial outliers , 2002, Intell. Data Anal..

[8]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[9]  David A. Belsley,et al.  Regression Analysis and its Application: A Data-Oriented Approach.@@@Applied Linear Regression.@@@Regression Diagnostics: Identifying Influential Data and Sources of Collinearity , 1981 .

[10]  Denis Hamad,et al.  Robust Regression and Outlier Detection with SVR: Application to Optic Flow Estimation , 2006, BMVC.

[11]  Hongxing He,et al.  Outlier Detection Using Replicator Neural Networks , 2002, DaWaK.

[12]  Daniel Peña,et al.  A New Statistic for Influence in Linear Regression , 2005, Technometrics.

[13]  Anthony C. Atkinson,et al.  Robust Diagnostic Regression Analysis , 2000 .

[14]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[15]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[16]  J. Simonoff,et al.  Procedures for the Identification of Multiple Outliers in Linear Models , 1993 .

[17]  A. Fielding Sensitivity Analysis in Linear Regression , 1990 .

[18]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[19]  Kay I Penny,et al.  A comparison of multivariate outlier detection methods for clinical laboratory safety data , 2001 .