DICOD: Distributed Convolutional Coordinate Descent for Convolutional Sparse Coding

In this paper, we introduce DICOD, a convolutional sparse coding algorithm which builds shift invariant representations for long signals. This algorithm is designed to run in a distributed setting, with local message passing, making it communication efficient. It is based on coordinate descent and uses locally greedy updates which accelerate the resolution compared to greedy coordinate selection. We prove the convergence of this algorithm and highlight its computational speed-up which is super-linear in the number of cores used. We also provide empirical evidence for the acceleration properties of our algorithm compared to state-of-the-art methods. 1. Convolutional Representation for Long Signals Sparse coding aims at building sparse linear representations of a data set based on a dictionary of basic elements called atoms. It has proven to be useful in many applications, ranging from EEG analysis to images and audio processing (Adler et al., 2013; Kavukcuoglu et al., 2010; Mairal et al., 2010; Grosse et al., 2007). Convolutional sparse coding is a specialization of this approach, focused on building sparse, shift-invariant representations of signals. Such representations present a major interest for applications like segmentation or classification as they separate the shape and the localization of patterns in a signal. This is typically the case for physiological signals which can be composed of recurrent patterns linked to specific behavior in the human body such as the characteristic heartbeat pattern in ECG recordings. Depending on the context, the dictionary can either be fixed analytically (e.g. wavelets, see Mallat 2008), or CMLA, ENS Cachan, CNRS, Université Paris-Saclay, 94235 Cachan, France L2TI, Université Paris 13, 93430 Villetaneuse, France. Correspondence to: Moreau Thomas <thomas.moreau@cmla.ens-cachan.fr>. Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). learned from the data (Bristow et al., 2013; Mairal et al., 2010). Several algorithms have been proposed to solve the convolutional sparse coding. The Fast Iterative SoftThresholding Algorithm (FISTA) was adapted for convolutional problems in Chalasani et al. (2013) and uses proximal gradient descent to compute the representation. The Feature Sign Search (FSS), introduced in Grosse et al. (2007), solves at each step a quadratic subproblem for an active set of the estimated nonzero coordinates and the Fast Convolutional Sparse Coding (FCSC) of Bristow et al. (2013) is based on Alternating Direction Method of Multipliers (ADMM). Finally, the coordinate descent (CD) has been extended by Kavukcuoglu et al. (2010) to solve the convolutional sparse coding. This method greedily optimizes one coordinate at each iteration using fast local updates. We refer the reader to Wohlberg (2016) for a detailed presentation of these algorithms. To our knowledge, there is no scalable version of these algorithms for long signals. This is a typical situation, for instance, in physiological signal processing where sensor information can be collected for a few hours with sampling frequencies ranging from 100 to 1000Hz. The existing algorithms for generic `1-regularized optimization can be accelerated by improving the computational complexity of each iteration. A first approach to improve the complexity of these algorithms is to estimate the non-zero coordinates of the optimal solution to reduce the dimension of the optimization space, using either screening (El Ghaoui et al., 2012; Fercoq et al., 2015) or active-set algorithms (Johnson & Guestrin, 2015). Another possibility is to develop parallel algorithms which compute multiple updates simultaneously. Recent studies have considered distributing coordinate descent algorithms for general `1-regularized minimization (Scherrer et al., 2012a;b; Bradley et al., 2011; Yu et al., 2012). These papers propose synchronous algorithms using either locks or synchronizing steps to ensure the convergence in general cases. You et al. (2016) derive an asynchronous distributed algorithm for the projected coordinate descent which uses centralized communication and finely tuned step size to ensure the convergence of their method. In the present paper, we design a novel distributed algoDICOD: Distributed Convolutional Sparse Coding rithm tailored for the convolutional problem which is based on coordinate descent, named Distributed Convolution Coordinate Descent (DICOD). DICOD is asynchronous and each process can run independently without locks or synchronization steps. This algorithm uses a local communication scheme to reduce the number messages between the processes and does not rely on external learning rates. We also prove that this algorithm scales super-linearly with the number of cores compared to the sequential CD, up to certain limitations. In Section 2, we introduce the DICOD algorithm for the resolution of convolutional sparse coding. Then, we prove in Section 3 that DICOD converges to the optimal solution for a wide range of settings and we analyze its complexity. Finally, Section 4 presents numerical experiments that illustrate the benefits of the DICOD algorithm with respect to other state-of-the-art algorithms and validate our theoretical analysis. 2. Distributed Convolutional Coordinate Descent (DICOD) Notations. The space of multivariate signals of length T in R is denoted by X T . For these signals, their value at time t ∈ J0, T − 1K is denoted by X[t] ∈ R and for all t / ∈ J0, T − 1K, X[t] = 0P . The indicator function of t0 is denoted 1t0 . For any signal X ∈ X T , the reversed signal is defined as X [t] = X[T − t] and the d-norm is defined as ‖X‖d = (∑T−1 t=0 ‖X[t]‖d )1/d . Finally, for L,W ∈ N∗, the convolution between Z ∈ X 1 L and D ∈ X W is a multivariate signal Z∗D ∈ X T with T=L+W−1 such that for t ∈ J0, T − 1K,

[1]  Laurent El Ghaoui,et al.  Safe Feature Elimination for the LASSO and Sparse Supervised Learning Problems , 2010, 1009.4219.

[2]  José Carlos Príncipe,et al.  A fast proximal method for convolutional sparse coding , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[3]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[4]  Mark W. Schmidt,et al.  Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection , 2015, ICML.

[5]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[6]  Ambuj Tewari,et al.  Feature Clustering for Accelerating Parallel Coordinate Descent , 2012, NIPS.

[7]  Ambuj Tewari,et al.  Scaling Up Coordinate Descent Algorithms for Large ℓ1 Regularization Problems , 2012, ICML.

[8]  Brendt Wohlberg,et al.  Efficient Algorithms for Convolutional Sparse Representations , 2016, IEEE Transactions on Image Processing.

[9]  Anders P. Eriksson,et al.  Fast Convolutional Sparse Coding , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Thomas Moreau Convolutional Sparse Representations - application to physiological signals and interpretability for Deep Learning. (Représentations Convolutives Parcimonieuses - application aux signaux physiologiques et interpétabilité de l'apprentissage profond) , 2017 .

[11]  Michael Elad,et al.  Sparse Coding with Anomaly Detection , 2013, Journal of Signal Processing Systems.

[12]  Joseph K. Bradley,et al.  Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[13]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[14]  S. Osher,et al.  Coordinate descent optimization for l 1 minimization with application to compressed sensing; a greedy algorithm , 2009 .

[15]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[16]  Tyler B. Johnson,et al.  Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization , 2015, ICML.

[17]  Y-Lan Boureau,et al.  Learning Convolutional Feature Hierarchies for Visual Recognition , 2010, NIPS.

[18]  Michael Elad,et al.  Working Locally Thinking Globally: Theoretical Guarantees for Convolutional Sparse Coding , 2017, IEEE Transactions on Signal Processing.

[19]  James Demmel,et al.  Asynchronous Parallel Greedy Coordinate Descent , 2016, NIPS.

[20]  Inderjit S. Dhillon,et al.  Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems , 2012, 2012 IEEE 12th International Conference on Data Mining.

[21]  S. Mallat A wavelet tour of signal processing , 1998 .

[22]  Alexandre Gramfort,et al.  Mind the duality gap: safer rules for the Lasso , 2015, ICML.