On the robustness and generalization of Cauchy regression

It was recently highlighted in a special issue of Nature [1] that the value of big data has yet to be effectively exploited for innovation, competition and productivity. To realize the full potential of big data, big learning algorithms need to be developed to keep pace with the continuous creation, storage and sharing of data. Least squares (LS) and least absolute deviation (LAD) have been successful regression tools used in business, government and society over the past few decades. However, these existing technologies are severely limited by noisy data because their breakdown points are both zero, i.e., they do not tolerate outliers. By appropriately setting the turning constant of Cauchy regression (CR), the maximum possible value (50%) of the breakdown point can be attained. CR therefore has the capability to learn a robust model from noisy big data. Although the theoretical analysis of the breakdown point for CR has been comprehensively investigated, we propose a new approach by interpreting the optimization of an objective function as a sample-weighted procedure. We therefore clearly show the differences of the robustness between LS, LAD and CR. We also study the statistical performance of CR. This study derives the generalization error bounds for CR by analyzing the covering number and Rademacher complexity of the hypothesis class, as well as showing how the scale parameter affects its performance.

[1]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[2]  H. R. Moore Robust regression using maximum-likelihood weighting and assuming Cauchy-distributed random error. , 1977 .

[3]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[4]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[5]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[6]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[7]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[8]  Tong Zhang,et al.  Covering Number Bounds of Certain Regularized Linear Function Classes , 2002, J. Mach. Learn. Res..

[9]  Shahar Mendelson,et al.  A Few Notes on Statistical Learning Theory , 2002, Machine Learning Summer School.

[10]  C. Müller,et al.  Breakdown points of Cauchy regression-scale estimators , 2002 .

[11]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[12]  Andreas Christmann,et al.  Bouligand Derivatives and Robustness of Support Vector Machines for Regression , 2007, J. Mach. Learn. Res..

[13]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[14]  Shie Mannor,et al.  Robustness and generalization , 2010, Machine Learning.

[15]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[16]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[17]  John Shawe-Taylor,et al.  MahNMF: Manhattan Non-negative Matrix Factorization , 2012, ArXiv.

[18]  Fernando Pereira,et al.  Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia , 2012 .

[19]  Tim Kraska,et al.  MLI: An API for Distributed Machine Learning , 2013, 2013 IEEE 13th International Conference on Data Mining.