Wavelet‐Based Clustering for Mixed‐Effects Functional Models in High Dimension

We propose a method for high-dimensional curve clustering in the presence of interindividual variability. Curve clustering has longly been studied especially using splines to account for functional random effects. However, splines are not appropriate when dealing with high-dimensional data and can not be used to model irregular curves such as peak-like data. Our method is based on a wavelet decomposition of the signal for both fixed and random effects. We propose an efficient dimension reduction step based on wavelet thresholding adapted to multiple curves and using an appropriate structure for the random effect variance, we ensure that both fixed and random effects lie in the same functional space even when dealing with irregular functions that belong to Besov spaces. In the wavelet domain our model resumes to a linear mixed-effects model that can be used for a model-based clustering algorithm and for which we develop an EM-algorithm for maximum likelihood estimation. The properties of the overall procedure are validated by an extensive simulation study. Then, we illustrate our method on mass spectrometry data and we propose an original application of functional data analysis on microarray comparative genomic hybridization (CGH) data. Our procedure is available through the R package curvclust which is the first publicly available package that performs curve clustering with random effects in the high dimensional framework (available on the CRAN).

[1]  Franck Picard,et al.  Preprocessing and downstream analysis of microarray DNA copy number profiles , 2011, Briefings Bioinform..

[2]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian Cancer , 2002 .

[3]  Ajay N. Jain,et al.  Breast tumor copy number aberration phenotypes and genomic instability , 2006, BMC Cancer.

[4]  M. A. van de Wiel,et al.  Weighted clustering of called array CGH data. , 2008, Biostatistics.

[5]  Jeffrey S. Morris,et al.  Wavelet‐based functional mixed models , 2006, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[6]  Jun Zhang,et al.  A wavelet-based KL-like expansion for wide-sense stationary random processes , 1994, IEEE Trans. Signal Process..

[7]  Jeffrey S. Morris,et al.  Bayesian Analysis of Mass Spectrometry Proteomic Data Using Wavelet‐Based Functional Mixed Models , 2008, Biometrics.

[8]  B. Silverman,et al.  Wavelet thresholding via a Bayesian approach , 1998 .

[9]  Robin Thompson,et al.  [That BLUP is a Good Thing: The Estimation of Random Effects]: Comment , 1991 .

[10]  K. Pearson,et al.  Biometrika , 1902, The American Naturalist.

[11]  Chao Yang,et al.  Comparison of public peak detection algorithms for MALDI mass spectrometry data analysis , 2009, BMC Bioinformatics.

[12]  O. John Semmes,et al.  Functional Clustering Algorithm for High-Dimensional Proteomics Data , 2005, Journal of biomedicine & biotechnology.

[13]  S. Tavaré,et al.  High-resolution aCGH and expression profiling identifies a novel genomic subtype of ER negative breast cancer , 2007, Genome Biology.

[14]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[15]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  J. Kiefer,et al.  Sequential minimax search for a maximum , 1953 .

[17]  T M Therneau,et al.  An insight into high-resolution mass-spectrometry data. , 2009, Biostatistics.

[18]  B. Mallick,et al.  Functional clustering by Bayesian wavelet methods , 2006 .

[19]  G. Weiss,et al.  Littlewood-Paley Theory and the Study of Function Spaces , 1991 .

[20]  Bill Bynum,et al.  Lancet , 2015, The Lancet.

[21]  G. Robinson That BLUP is a Good Thing: The Estimation of Random Effects , 1991 .

[22]  Anestis Antoniadis,et al.  Nonparametric Pre-Processing Methods and Inference Tools for Analyzing Time-of-Flight Mass Spectrometry Data. , 2007 .

[23]  Fan Yang,et al.  Using random forest for reliable classification and cost-sensitive learning for medical diagnosis , 2009, BMC Bioinformatics.

[24]  Elvira Romano,et al.  Spatial functional normal mixed effect approach for curve classification , 2014, Adv. Data Anal. Classif..

[25]  Catherine A. Sugar,et al.  Clustering for Sparsely Sampled Functional Data , 2003 .

[26]  Bernard W. Silverman,et al.  Functional Data Analysis , 1997 .

[27]  Jianqing Fan Test of Significance Based on Wavelet Thresholding and Neyman's Truncation , 1996 .

[28]  Masahiro Kuroda,et al.  Accelerating the convergence of the EM algorithm using the vector epsilon , 2006, Comput. Stat. Data Anal..

[29]  I. Johnstone,et al.  Minimax estimation via wavelet shrinkage , 1998 .

[30]  Anestis Antoniadis,et al.  Estimation and inference in functional mixed-effects models , 2007, Comput. Stat. Data Anal..

[31]  M. Hilario,et al.  Processing and classification of protein mass spectra. , 2006, Mass spectrometry reviews.

[32]  Anestis Antoniadis,et al.  A Multiscale Approach for Statistical Characterization of Functional Images , 2009 .

[33]  Jeffrey S. Morris,et al.  Statistical contributions to proteomic research. , 2010, Methods in molecular biology.