Distance for Functional Data Clustering Based on Smoothing Parameter Commutation

We propose a novel method to determine the dissimilarity between subjects for functional data clustering. Spline smoothing or interpolation is common to deal with data of such type. Instead of estimating the best-representing curve for each subject as fixed during clustering, we measure the dissimilarity between subjects based on varying curve estimates with commutation of smoothing parameters pair-by-pair (of subjects). The intuitions are that smoothing parameters of smoothing splines reflect inverse signal-to-noise ratios and that applying an identical smoothing parameter the smoothed curves for two similar subjects are expected to be close. The effectiveness of our proposal is shown through simulations comparing to other dissimilarity measures. It also has several pragmatic advantages. First, missing values or irregular time points can be handled directly, thanks to the nature of smoothing splines. Second, conventional clustering method based on dissimilarity can be employed straightforward, and the dissimilarity also serves as a useful tool for outlier detection. Third, the implementation is almost handy since subroutines for smoothing splines and numerical integration are widely available. Fourth, the computational complexity does not increase and is parallel with that in calculating Euclidean distance between curves estimated by smoothing splines.

[1]  Pablo Montero,et al.  TSclust: An R Package for Time Series Clustering , 2014 .

[2]  Charles Bouveyron,et al.  Model-based clustering of high-dimensional data: A review , 2014, Comput. Stat. Data Anal..

[3]  Julien Jacques,et al.  Funclust: A curves clustering method using functional random variables density approximation , 2013, Neurocomputing.

[4]  Charles Bouveyron,et al.  Model-based clustering of time series in group-specific functional subspaces , 2011, Adv. Data Anal. Classif..

[5]  José Antonio Vilar,et al.  Non-linear time series clustering based on non-parametric forecast densities , 2010, Comput. Stat. Data Anal..

[6]  Christophe Genolini,et al.  KmL: k-means for longitudinal data , 2010, Comput. Stat..

[7]  P. McNicholas,et al.  Model‐based clustering of longitudinal data , 2010 .

[8]  P. Hall,et al.  Defining probability density for a distribution of random functions , 2010, 1002.4931.

[9]  Xueli Liu,et al.  Simultaneous curve registration and clustering for functional data , 2009, Comput. Stat. Data Anal..

[10]  G. Kauermann,et al.  A Note on Penalized Spline Smoothing With Correlated Errors , 2007 .

[11]  Jeng-Min Chiou,et al.  Functional clustering and identifying substructures of longitudinal data , 2007 .

[12]  Daniel S. Nagin,et al.  Advances in Group-Based Trajectory Modeling and an SAS Procedure for Estimating Them , 2007 .

[13]  Ahlame Douzal Chouakria,et al.  Adaptive dissimilarity index for measuring time series proximity , 2007, Adv. Data Anal. Classif..

[14]  J. R. Berrendero,et al.  Time series clustering based on forecast densities , 2006, Comput. Stat. Data Anal..

[15]  Jorge Caiado,et al.  A periodogram-based metric for time series classification , 2006, Comput. Stat. Data Anal..

[16]  T. Warren Liao,et al.  Clustering of time series data - a survey , 2005, Pattern Recognit..

[17]  Padhraic Smyth,et al.  Joint Probabilistic Curve Clustering and Alignment , 2004, NIPS.

[18]  Jianqing Fan,et al.  Generalised likelihood ratio tests for spectral density , 2004 .

[19]  C. Abraham,et al.  Unsupervised Curve Clustering using B‐Splines , 2003 .

[20]  Catherine A. Sugar,et al.  Clustering for Sparsely Sampled Functional Data , 2003 .

[21]  Thaddeus Tarpey,et al.  Clustering Functional Data , 2003, J. Classif..

[22]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[23]  Yuedong Wang Smoothing Spline Models with Correlated Random Errors , 1998 .

[24]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[25]  B. Silverman,et al.  Nonparametric Regression and Generalized Linear Models: A roughness penalty approach , 1993 .

[26]  G. Wahba,et al.  Some New Mathematical Methods for Variational Objective Analysis Using Splines and Cross Validation , 1980 .

[27]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[28]  Christian Hennig,et al.  Clustering and a Dissimilarity Measure for Methadone Dosage Time Series , 2014, ECDA.

[29]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[30]  Fionn Murtagh,et al.  Algorithms for hierarchical clustering: an overview , 2012, WIREs Data Mining Knowl. Discov..

[31]  Eamonn J. Keogh,et al.  A Complexity-Invariant Distance Measure for Time Series , 2011, SDM.

[32]  Andreas M. Brandmaier,et al.  Permutation distribution clustering and structural equation model trees , 2011 .

[33]  David Casado,et al.  Classification techniques for time series and functional data , 2010 .

[34]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[35]  Catherine A. Sugar,et al.  Principal component models for sparse functional data , 1999 .

[36]  Elizabeth Ann Maharaj,et al.  A SIGNIFICANCE TEST FOR CLASSIFYING ARMA MODELS , 1996 .

[37]  Smoothing Functional Data with a Roughness Penalty 5.1 Introduction , 2022 .