Model-based clustering of probability density functions

Complex data such as those where each statistical unit under study is described not by a single observation (or vector variable), but by a unit-specific sample of several or even many observations, are becoming more and more popular. Reducing these sample data by summary statistics, like the average or the median, implies that most inherent information (about variability, skewness or multi-modality) gets lost. Full information is preserved only if each unit is described by a whole distribution. This new kind of data, a.k.a. “distribution-valued data”, require the development of adequate statistical methods. This paper presents a method to group a set of probability density functions (pdfs) into homogeneous clusters, provided that the pdfs have to be estimated nonparametrically from the unit-specific data. Since elements belonging to the same cluster are naturally thought of as samples from the same probability model, the idea is to tackle the clustering problem by defining and estimating a proper mixture model on the space of pdfs. The issue of model building is challenging here because of the infinite-dimensionality and the non-Euclidean geometry of the domain space. By adopting a wavelet-based representation for the elements in the space, the task is accomplished by using mixture models for hyper-spherical data. The proposed solution is illustrated through a simulation experiment and on two real data sets.

[1]  Carlo Cattani,et al.  Fractals and Hidden Symmetries in DNA , 2010 .

[2]  Baba C. Vemuri,et al.  Using the KL-center for efficient and accurate retrieval of distributions arising from texture images , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[3]  Monique Noirhomme-Fraiture,et al.  Far beyond the classical data models: symbolic data analysis , 2011, Stat. Anal. Data Min..

[4]  Suvrit Sra,et al.  The multivariate Watson distribution: Maximum-likelihood estimation and other aspects , 2011, J. Multivar. Anal..

[5]  Elizabeth Ann Maharaj,et al.  Wavelet-based Fuzzy Clustering of Time Series , 2010, J. Classif..

[6]  Irene A. Stegun,et al.  Handbook of Mathematical Functions. , 1966 .

[7]  Monique Noirhomme-Fraiture,et al.  Symbolic Data Analysis and the SODAS Software , 2008 .

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[10]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[11]  A. Walden,et al.  Wavelet Methods for Time Series Analysis , 2000 .

[12]  M. Wand,et al.  EXACT MEAN INTEGRATED SQUARED ERROR , 1992 .

[13]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[14]  Herbert A. Sturges,et al.  The Choice of a Class Interval , 1926 .

[15]  Spiridon Penev,et al.  On non-negative wavelet-based density estimators , 1997 .

[16]  Anuj Srivastava,et al.  Riemannian Analysis of Probability Density Functions with Applications in Vision , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Anand Rangarajan,et al.  Maximum Likelihood Wavelet Density Estimation With Applications to Image and Shape Matching , 2008, IEEE Transactions on Image Processing.

[18]  Stéphane Mallat,et al.  A Theory for Multiresolution Signal Decomposition: The Wavelet Representation , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Bas J. Wouters,et al.  Brief Report Results and Discussion , 2022 .

[20]  Mathieu Vrac,et al.  Copula analysis of mixture models , 2012, Comput. Stat..

[21]  Inderjit S. Dhillon,et al.  Clustering on the Unit Hypersphere using von Mises-Fisher Distributions , 2005, J. Mach. Learn. Res..

[22]  W. J. Whiten,et al.  Fitting Mixtures of Kent Distributions to Aid in Joint Set Identification , 2001 .

[23]  Chid Apte,et al.  Proceedings of the 2007 SIAM International Conference on Data Mining , 2007 .

[24]  Antonio Irpino,et al.  Comparing Histogram Data Using a Mahalanobis–Wasserstein Distance , 2008 .

[25]  M. C. Jones,et al.  A reliable data-based bandwidth selection method for kernel density estimation , 1991 .

[26]  Brani Vidakovic,et al.  Estimating the square root of a density via compactly supported wavelets , 1997 .

[27]  Simon Urbanek,et al.  Unsupervised clustering of multidimensional distributions using earth mover distance , 2011, KDD.

[28]  Edwin Diday,et al.  Symbolic Data Analysis: A Mathematical Framework and Tool for Data Mining , 1999, Electron. Notes Discret. Math..

[29]  B. Silverman,et al.  Some new methods for wavelet density estimation , 2001 .

[30]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[31]  Christos Faloutsos,et al.  Efficient Distribution Mining and Classification , 2008, SDM.

[32]  Inna Chervoneva,et al.  Two-stage hierarchical modeling for analysis of subpopulations in conditional distributions , 2012, Journal of applied statistics.

[33]  Pedro Delicado,et al.  Dimensionality reduction when data are density functions , 2011, Comput. Stat. Data Anal..

[34]  R. Ogden,et al.  Essential Wavelets for Statistical Applications and Data Analysis , 1996 .

[35]  Martial Guillaud,et al.  Classifying tissue samples from measurements on cells with within-class tissue sample heterogeneity. , 2011, Biostatistics.

[36]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[37]  Hans-Hermann Bock,et al.  Analysis of Symbolic Data , 2000 .