Robust archetypoids for anomaly detection in big functional data

Archetypoid analysis (ADA) has proven to be a successful unsupervised statistical technique to identify extreme observations in the periphery of the data cloud, both in classical multivariate data and functional data. However, two questions remain open in this field: the use of ADA for outlier detection and its scalability. We propose to use robust functional archetypoids and adjusted boxplot to pinpoint functional outliers. Furthermore, we present a new archetypoid algorithm for obtaining results from large data sets in reasonable time. Functional time series are occurring in many practical problems, so this paper focuses on functional data settings. The new algorithm for detecting functional anomalies, called CRO-FADALARA, can be used with both univariate and multivariate curves. Our proposal for outlier detection is compared with all the state-of-the-art methods in a controlled study, showing a good performance. Furthermore, CRO-FADALARA is applied to two large time series data sets, where outliers curves are discussed and the reduction in computational time is clearly stated. A third case study with a small ECG data set is discussed, given its importance in functional data scenarios. All data, R code and a new R package are freely available.

[1]  Manuel Febrero-Bande,et al.  Statistical Computing in Functional Data Analysis: The R Package fda.usc , 2012 .

[2]  Irene Epifanio,et al.  Detection of Anomalies in Water Networks by Functional Data Analysis , 2018, Mathematical Problems in Engineering.

[3]  Alexander Vergara,et al.  On the calibration of sensor arrays for pattern recognition using the minimal number of experiments , 2014 .

[4]  Manuel J. A. Eugster,et al.  Weighted and robust archetypal analysis , 2011, Comput. Stat. Data Anal..

[5]  Spencer Graves,et al.  Functional Data Analysis with R and MATLAB , 2009 .

[6]  Saraleesan Nadarajah,et al.  An Expression for Fast Computation of Sample Central Moments , 2018 .

[7]  James O. Ramsay,et al.  Functional Data Analysis , 2005 .

[8]  Ricardo Fraiman,et al.  Resistant estimates for high dimensional and functional data based on random projections , 2011, Comput. Stat. Data Anal..

[9]  Amelia Simó,et al.  Archetypal shapes based on landmarks and extension to handle missing data , 2018, Adv. Data Anal. Classif..

[10]  S. Van Aelst,et al.  M-estimators of location for functional data , 2018, Bernoulli.

[11]  Irene Epifanio,et al.  Finding archetypal patterns for binary questionnaires , 2020 .

[12]  Federico Rotolo,et al.  parfm: Parametric Frailty Models in R , 2012 .

[13]  Mia Hubert,et al.  An adjusted boxplot for skewed distributions , 2008, Comput. Stat. Data Anal..

[14]  Rob J. Hyndman,et al.  Robust forecasting of mortality and fertility rates: A functional data approach , 2007, Comput. Stat. Data Anal..

[15]  Mia Hubert,et al.  Multivariate and functional classification using depth and distance , 2017, Adv. Data Anal. Classif..

[16]  Irene Epifanio,et al.  Functional archetype and archetypoid analysis , 2016, Comput. Stat. Data Anal..

[17]  Irene Epifanio,et al.  ARCHETYPAL ANALYSIS: AN ALTERNATIVE TO CLUSTERING FOR UNSUPERVISED TEXTURE SEGMENTATION , 2019, Image Analysis & Stereology.

[18]  Sandra Alemany,et al.  Archetypoids: A new approach to define representative archetypal data , 2015, Comput. Stat. Data Anal..

[19]  Luis F. Chiroque,et al.  Unsupervised Scalable Statistical Method for Identifying Influential Users in Online Social Networks , 2018, Scientific Reports.

[20]  Juan Romo,et al.  Shape outlier detection and visualization for functional data: the outliergram. , 2013, Biostatistics.

[21]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[22]  Irene Epifanio,et al.  Robust multivariate and functional archetypal analysis with application to financial time series analysis , 2018, Physica A: Statistical Mechanics and its Applications.

[23]  Wenceslao González-Manteiga,et al.  A functional analysis of NOx levels: location and scale estimation and outlier detection , 2007, Comput. Stat..

[24]  I. Epifanio,et al.  Forecasting basketball players' performance using sparse functional data , 2019, Stat. Anal. Data Min..

[25]  Hsing,et al.  Functional Data Analysis , 2015 .

[26]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[27]  Pavlos Protopapas,et al.  Finding anomalous periodic time series , 2009, Machine Learning.

[28]  Amelia Simó,et al.  Archetypal Analysis With Missing Data: See All Samples by Looking at a Few Based on Extreme Profiles , 2020, The American Statistician.

[29]  Mia Hubert,et al.  mrfDepth: Depth Measures in Multivariate, Regression and Functional Settings , 2017 .

[30]  M. Genton,et al.  Functional Boxplots , 2011 .

[31]  Shankar Vembu,et al.  Chemical gas sensor drift compensation using classifier ensembles , 2012 .

[32]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[33]  J. Tukey,et al.  The Fitting of Power Series, Meaning Polynomials, Illustrated on Band-Spectroscopic Data , 1974 .

[34]  Francesca Ieva,et al.  roahd Package: Robust Analysis of High Dimensional Data , 2019, R J..

[35]  Guillermo Vinué,et al.  Anthropometry: An R Package for Analysis of Anthropometric Data , 2017 .

[36]  Zaïd Harchaoui,et al.  Fast and Robust Archetypal Analysis for Representation Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Rob J Hyndman,et al.  Rainbow Plots, Bagplots, and Boxplots for Functional Data , 2010 .

[38]  Ulf Brefeld,et al.  Frame-based Data Factorizations , 2017, ICML.

[39]  Weiwei Sun,et al.  Pure endmember extraction using robust kernel archetypoid analysis for hyperspectral imagery , 2017 .

[40]  Mia Hubert,et al.  Multivariate functional outlier detection , 2015, Statistical Methods & Applications.

[41]  Amelia Simó,et al.  A data-driven classification of 3D foot types by archetypal shapes based on landmarks , 2020, PloS one.

[42]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[43]  Alejandro Correa,et al.  Gene expression analysis of human adipose tissue-derived stem cells during the initial steps of in vitro osteogenesis , 2018, Scientific Reports.

[44]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[45]  Irene Epifanio,et al.  Archetypoid analysis for sports analytics , 2017, Data Mining and Knowledge Discovery.

[46]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[47]  Derek S. Young,et al.  tolerance: An R Package for Estimating Tolerance Intervals , 2010 .

[48]  M. Febrero,et al.  Outlier detection in functional data by depth measures, with application to identify abnormal NOx levels , 2008 .