Convex Clustering: An Attractive Alternative to Hierarchical Clustering

The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/

[1]  M. Feldman,et al.  Genetic Structure of Human Populations , 2002, Science.

[2]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[3]  K. Weiss,et al.  Race, ancestry, and genes: implications for defining disease risk. , 2003, Annual review of genomics and human genetics.

[4]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .

[5]  D. A. Wolf Recent advances in descriptive multivariate analysis , 1996 .

[6]  Adrian S. Lewis,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[7]  R. Steele Optimization , 2005 .

[8]  N. Stanietsky,et al.  The interaction of TIGIT with PVR and PVRL2 inhibits human NK cell cytotoxicity , 2009, Proceedings of the National Academy of Sciences.

[9]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[10]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[11]  J. Hopcroft,et al.  Efficient algorithms for graph manipulation , 1971 .

[12]  Ying Xiong Nonlinear Optimization , 2014 .

[13]  K. Lange,et al.  The MM Alternative to EM , 2010, 1104.2203.

[14]  W. J. Krzanowski,et al.  Recent Advances in Descriptive Multivariate Analysis. , 1996 .

[15]  L. Ljung,et al.  Clustering using sum-of-norms regularization: With application to particle filter output computation , 2011, 2011 IEEE Statistical Signal Processing Workshop (SSP).

[16]  Gabor T. Marth,et al.  Demographic history and rare allele sharing among human populations , 2011, Proceedings of the National Academy of Sciences.

[17]  G. Casella,et al.  Springer Texts in Statistics , 2016 .

[18]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[19]  J. Hopcroft,et al.  Algorithm 447: efficient algorithms for graph manipulation , 1973, CACM.

[20]  M. Feldman,et al.  Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation , 2008 .

[21]  A. Gelman,et al.  Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box , 2011 .

[22]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: dominant markers and null alleles , 2007, Molecular ecology notes.

[23]  Noah A Rosenberg,et al.  Low Levels of Genetic Divergence across Geographically and Linguistically Diverse Populations from India , 2006, PLoS genetics.

[24]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[25]  Mattias Jakobsson,et al.  Genetic Variation and Population Structure in Native Americans , 2007, PLoS genetics.

[26]  F. Clarke Optimization And Nonsmooth Analysis , 1983 .

[27]  Genevera I. Allen,et al.  Convex biclustering , 2014, Biometrics.

[28]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[29]  A. Torroni,et al.  The Complex and Diversified Mitochondrial Gene Pool of Berber Populations , 2009, Annals of human genetics.

[30]  Chris Tyler-Smith,et al.  Y-chromosomal DNA variation in Pakistan. , 2002, American journal of human genetics.

[31]  D. Hunter,et al.  Optimization Transfer Using Surrogate Objective Functions , 2000 .

[32]  Zhenghai Ma,et al.  Genetic diversities of cytochrome B in Xinjiang Uyghur unveiled its origin and migration history , 2012, BMC Genetics.

[33]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[34]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[35]  John Novembre,et al.  The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research. , 2008, American journal of human genetics.

[36]  D. Hunter,et al.  A Tutorial on MM Algorithms , 2004 .

[37]  Francis R. Bach,et al.  Clusterpath: an Algorithm for Clustering using Convex Fusion Penalties , 2011, ICML.

[38]  M. Cugmas,et al.  On comparing partitions , 2015 .

[39]  H. Rochefort,et al.  How to target estrogen receptor-negative breast cancer? , 2003, Endocrine-related cancer.

[40]  Karin Schwab,et al.  Best Approximation In Inner Product Spaces , 2016 .

[41]  A. Ruszczynski,et al.  Nonlinear Optimization , 2006 .

[42]  Eric C. Chi,et al.  Splitting Methods for Convex Clustering , 2013, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.