FSDA: A MATLAB toolbox for robust analysis and interactive data exploration

Abstract We present the FSDA (Forward Search for Data Analysis) toolbox, a new software library that extends MATLAB and its Statistics Toolbox to support a robust and efficient analysis of complex datasets, affected by different sources of heterogeneity. As the name of the library indicates, the project was born around the Forward Search approach, but it has evolved to include the main traditional robust multivariate and regression techniques, including LMS, LTS, MCD, MVE, MM and S estimation. To address problems where data deviate from typical model assumptions, tools are available for robust data transformation and robust model selection. When different views of the data are available, e.g. a scatterplot of units and a plot of distances of such units from a fitted model, FSDA links such views and offers the possibility to interact with them. For example, selections of objects in a plot are highlighted in the other plots. This considerably simplifies the exploration of the data in view of extracting information and detecting patterns. We show the potential of the FSDA in chemometrics using data from chemical and pharmaceutical problems, where the presence of outliers, multiple groups, deviations from normality and other complex structures is not an exceptional circumstance.

[1]  Anthony C. Atkinson,et al.  The forward search: theory and data analysis , 2010 .

[2]  A. Atkinson,et al.  Finding an unknown number of multivariate outliers , 2009 .

[3]  M. Jhun,et al.  Asymptotics for the minimum covariance determinant estimator , 1993 .

[4]  Elvezio Ronchetti Robust Model Selection , 1994 .

[5]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[6]  Anthony C. Atkinson,et al.  Robust model selection with flexible trimming , 2010, Comput. Stat. Data Anal..

[7]  J. Tukey The Ninther, a Technique for Low-Effort Robust (Resistant) Location in Large Samples , 1978 .

[8]  J RousseeuwPeter,et al.  A fast algorithm for the minimum covariance determinant estimator , 1999 .

[9]  Christophe Croux,et al.  TOMCAT: A MATLAB toolbox for multivariate calibration techniques , 2007 .

[10]  R. A. Fisher,et al.  Statistical Tables for Biological, Agricultural and Medical Research , 1956 .

[11]  PETER J. ROUSSEEUW,et al.  Computing LTS Regression for Large Data Sets , 2005, Data Mining and Knowledge Discovery.

[12]  C. A. R. Hoare,et al.  Algorithm 64: Quicksort , 1961, Commun. ACM.

[13]  Peter Filzmoser,et al.  Review of robust multivariate statistical methods in high dimension. , 2011, Analytica chimica acta.

[14]  Francesca Torti,et al.  New robust dynamic plots for regression mixture detection , 2009, Adv. Data Anal. Classif..

[15]  Mia Hubert,et al.  MATLAB library LIBRA , 2010 .

[16]  G. M. Tallis Elliptical and Radial Truncation in Normal Populations , 1963 .

[17]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[18]  J. Friedman,et al.  Estimating Optimal Transformations for Multiple Regression and Correlation. , 1985 .

[19]  Anthony C. Atkinson,et al.  Computational Statistics and Data Analysis , 2022 .

[20]  Mia Hubert,et al.  Robustness and Outlier Detection in Chemometrics , 2006 .

[21]  Mia Hubert,et al.  LIBRA: a MATLAB library for robust analysis , 2005 .

[22]  Anthony C. Atkinson,et al.  Robust Diagnostic Data Analysis: Transformations in Regression , 2000, Technometrics.

[23]  Peter J. Rousseeuw,et al.  The Remedian: A Robust Averaging Method for Large Data Sets , 1990 .

[24]  V. Yohai HIGH BREAKDOWN-POINT AND HIGH EFFICIENCY ROBUST ESTIMATES FOR REGRESSION , 1987 .

[25]  D. G. Simpson,et al.  Unmasking Multivariate Outliers and Leverage Points: Comment , 1990 .

[26]  Salvatore Ingrassia,et al.  New perspectives in statistical modeling and data analysis: proceedings of the 7th Conference of the Classification and data analysis group of the Italian statistical Society, Catania, September 9 - 11, 2009 , 2011 .

[27]  Francesca Torti,et al.  Size and Power of Tests for Regression Outliers in the Forward Search , 2011 .

[28]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[29]  Anthony C. Atkinson,et al.  Robust Diagnostic Regression Analysis , 2000 .

[30]  Kuldeep Kumar Robust Statistics, 2nd edition by P.J. Huber & E.M. Ronchetti [book review] , 2011 .

[31]  Anthony C. Atkinson,et al.  Forward search added-variable t-tests and the effect of masked outliers on model selection , 2002 .

[32]  Catherine Dehon,et al.  Influence functions of the Spearman and Kendall correlation measures , 2010, Stat. Methods Appl..

[33]  Viswanath Devanarayan,et al.  Recommendations for the validation of immunoassays used for detection of host antibodies against biotechnology products. , 2008, Journal of pharmaceutical and biomedical analysis.

[34]  Andrea Cerioli,et al.  Multivariate Outlier Detection With High-Breakdown Estimators , 2010 .

[35]  Uri Zwick,et al.  Selecting the median , 1995, SODA '95.

[36]  V. Yohai,et al.  A Fast Algorithm for S-Regression Estimates , 2006 .

[37]  L. Tippett Statistical Tables: For Biological, Agricultural and Medical Research , 1954 .

[38]  G. Willems,et al.  Small sample corrections for LTS and MCD , 2002 .

[39]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[40]  Peter Filzmoser,et al.  Robust Multivariate Methods in Chemometrics , 2020, Comprehensive Chemometrics.

[41]  P. Rousseeuw,et al.  Unmasking Multivariate Outliers and Leverage Points , 1990 .

[42]  Douglas M. Hawkins,et al.  Outliers Everywhere’, - discussion of ‘Unmasking Multivariate Outliers and Leverage Points , 1990 .

[43]  V. Yohai,et al.  Robust Statistics: Theory and Methods , 2006 .

[44]  Oleg A. Smirnov Computation of the Information Matrix for Models With Spatial Interaction on a Lattice , 2005 .

[45]  Donald E. Knuth,et al.  The art of computer programming. Vol.2: Seminumerical algorithms , 1981 .

[46]  Ola Hössjer,et al.  On the optimality of S-estimators☆ , 1992 .

[47]  T. Banerjee Exploring Multivariate Data With the Forward Search , 2006 .

[48]  David M. Rocke,et al.  The Distribution of Robust Distances , 2005 .

[49]  P. Rousseeuw Tutorial to robust statistics , 1991 .

[50]  Manuel Blum,et al.  Time Bounds for Selection , 1973, J. Comput. Syst. Sci..

[51]  P. Rousseeuw,et al.  A fast algorithm for the minimum covariance determinant estimator , 1999 .

[52]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[53]  Ursula Gather,et al.  The Masking Breakdown Point of Multivariate Outlier Identification Rules , 1999 .

[54]  Anthony C. Atkinson,et al.  Tests in the fan plot for robust, diagnostic transformations in regression , 2002 .

[55]  P. Rousseeuw Multivariate estimation with high breakdown point , 1985 .

[56]  Y. Heyden,et al.  Robust statistics in data analysis — A review: Basic concepts , 2007 .