Dynamic visualization for L1 fusion convex clustering in near-linear time

Convex clustering has drawn recent attention because of its competitive performance and nice property to guarantee global optimality. However, convex clustering is infeasible due to its high computational cost for large-scale data sets. We propose a novel method to solve the L1 fusion convex clustering problem by dynamic programming. We develop the Convex clustering Path Algorithm In Near-linear Time (C-PAINT) to construct the solution path efficiently. The proposed C-PAINT yields the exact solution while other general solvers for convex problems applied in the convex clustering depend on tuning parameters such as step size and threshold, and it usually takes many iterations to converge. Including a sorting process that almost takes no time in practice, the main part of the algorithm takes only linear time. Thus, C-PAINT has superior scalability comparing to other state-of-art algorithms. Moreover, C-PAINT enables the path visualization of clustering solutions for large data. In particular, experiments show our proposed method can solve the convex clustering with 107 data points in R2 in two minutes. We demonstrate the proposed method using both synthetic data and real data. Our algorithms are implemented in the dpcc R package.

[1]  Holger Hoefling A Path Algorithm for the Fused Lasso Signal Approximator , 2009, 0910.0526.

[2]  J. Chiquet,et al.  Fast Tree Inference With Weighted Fusion Penalties , 2014, 1407.5915.

[3]  Shuicheng Yan,et al.  Convex Optimization Procedure for Clustering: Theoretical Revisit , 2014, NIPS.

[4]  Nicholas A. Johnson,et al.  A Dynamic Programming Algorithm for the Fused Lasso and L 0-Segmentation , 2013 .

[5]  L. Ljung,et al.  Clustering using sum-of-norms regularization: With application to particle filter output computation , 2011, 2011 IEEE Statistical Signal Processing Workshop (SSP).

[6]  Eter,et al.  Convex clustering via `1 fusion penalization , 2016 .

[7]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[8]  Wei Sun,et al.  Sparse Convex Clustering , 2016, ArXiv.

[9]  J. Suykens,et al.  Convex Clustering Shrinkage , 2005 .

[10]  Kean Ming Tan,et al.  Statistical properties of convex clustering. , 2015, Electronic journal of statistics.

[11]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[12]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[13]  Christoph Hafemeister,et al.  Comprehensive integration of single cell data , 2018, bioRxiv.

[14]  Francis R. Bach,et al.  Clusterpath: an Algorithm for Clustering using Convex Fusion Penalties , 2011, ICML.

[15]  Kim-Chuan Toh,et al.  An Efficient Semismooth Newton Based Algorithm for Convex Clustering , 2018, ICML.

[16]  Eric C. Chi,et al.  Splitting Methods for Convex Clustering , 2013, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[17]  Thomas S. Huang,et al.  Robust Convex Clustering Analysis , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[18]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[19]  Genevera I. Allen,et al.  Dynamic Visualization and Fast Computation for Convex Clustering via Algorithmic Regularization , 2019, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[20]  Alex Lamb,et al.  Deep Learning for Classical Japanese Literature , 2018, ArXiv.