Interactive visual exploration and refinement of cluster assignments

BackgroundWith ever-increasing amounts of data produced in biology research, scientists are in need of efficient data analysis methods. Cluster analysis, combined with visualization of the results, is one such method that can be used to make sense of large data volumes. At the same time, cluster analysis is known to be imperfect and depends on the choice of algorithms, parameters, and distance measures. Most clustering algorithms don’t properly account for ambiguity in the source data, as records are often assigned to discrete clusters, even if an assignment is unclear. While there are metrics and visualization techniques that allow analysts to compare clusterings or to judge cluster quality, there is no comprehensive method that allows analysts to evaluate, compare, and refine cluster assignments based on the source data, derived scores, and contextual data.ResultsIn this paper, we introduce a method that explicitly visualizes the quality of cluster assignments, allows comparisons of clustering results and enables analysts to manually curate and refine cluster assignments. Our methods are applicable to matrix data clustered with partitional, hierarchical, and fuzzy clustering algorithms. Furthermore, we enable analysts to explore clustering results in context of other data, for example, to observe whether a clustering of genomic data results in a meaningful differentiation in phenotypes.ConclusionsOur methods are integrated into Caleydo StratomeX, a popular, web-based, disease subtype analysis tool. We show in a usage scenario that our approach can reveal ambiguities in cluster assignments and produce improved clusterings that better differentiate genotypes and phenotypes.

[1]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[2]  Christian Partl,et al.  Caleydo Web : An Integrated Visual Analysis Platform for Biomedical Data , 2015 .

[3]  Kay Nieselt,et al.  A Framework for Visualization of Microarray Data and Integrated Meta Information , 2005, Inf. Vis..

[4]  Alexander Lex,et al.  From Visual Exploration to Storytelling and Back Again , 2016, bioRxiv.

[5]  Hanspeter Pfister,et al.  UpSet: Visualization of Intersecting Sets , 2014, IEEE Transactions on Visualization and Computer Graphics.

[6]  Guangfeng Song,et al.  HIV-1, human interaction database: current status and new features , 2014, Nucleic Acids Res..

[7]  Jinwook Seo,et al.  XCluSim: a visual analytics tool for interactively comparing multiple clustering results of bioinformatics data , 2015, BMC Bioinformatics.

[8]  Steven J. M. Jones,et al.  Comprehensive molecular characterization of clear cell renal cell carcinoma , 2013, Nature.

[9]  Helwig Hauser,et al.  Parallel Sets: interactive exploration and visual analysis of categorical data , 2006, IEEE Transactions on Visualization and Computer Graphics.

[10]  Chao Wang,et al.  iGPSe: A visual analytic system for integrative genomic based cancer patient stratification , 2014, BMC Bioinformatics.

[11]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[12]  S. Gabriel,et al.  Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. , 2010, Cancer cell.

[13]  Hanspeter Pfister,et al.  Characterizing Cancer Subtypes Using Dual Analysis in Caleydo StratomeX , 2014, IEEE Computer Graphics and Applications.

[14]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[15]  Georges G. Grinstein,et al.  Visually comparing multiple partitions of data with applications to clustering , 2009, Electronic Imaging.

[16]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[17]  Jaak Vilo,et al.  ClustVis: a web tool for visualizing clustering of multivariate data using Principal Component Analysis and heatmap , 2015, Nucleic Acids Res..

[18]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[19]  Francisco Azuaje,et al.  Cluster validation techniques for genome expression data , 2003, Signal Process..

[20]  Dieter Schmalstieg,et al.  StratomeX: Visual Analysis of Large‐Scale Heterogeneous Genomics Data for Cancer Subtype Characterization , 2012, Comput. Graph. Forum.

[21]  Hanspeter Pfister,et al.  Domino: Extracting, Comparing, and Manipulating Subsets Across Multiple Tabular Datasets , 2014, IEEE Transactions on Visualization and Computer Graphics.

[22]  Ben Shneiderman,et al.  Interactively Exploring Hierarchical Clustering Results , 2002, Computer.

[23]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Anil K. Jain,et al.  A self-organizing network for hyperellipsoidal clustering (HEC) , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[25]  Liu Rui,et al.  Fuzzy c-Means Clustering Algorithm , 2008 .

[26]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[27]  Benjamin J. Raphael,et al.  Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. , 2013, The New England journal of medicine.

[28]  Kay Nieselt,et al.  Mayday-a microarray data analysis workbench , 2006, Bioinform..

[29]  Marc Streit,et al.  Furby: fuzzy force-directed bicluster visualization , 2014, BMC Bioinformatics.

[30]  Steven J. M. Jones,et al.  Genomic Classification of Cutaneous Melanoma , 2015, Cell.

[31]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[32]  Jeffrey Heer,et al.  SpanningAspectRatioBank Easing FunctionS ArrayIn ColorIn Date Interpolator MatrixInterpola NumObjecPointI Rectang ISchedu Parallel Pause Scheduler Sequen Transition Transitioner Transiti Tween Co DelimGraphMLCon IData JSONCon DataField DataSc Dat DataSource Data DataUtil DirtySprite LineS RectSprite , 2011 .

[33]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[34]  The Cancer Genome Atlas Research Network COMPREHENSIVE MOLECULAR CHARACTERIZATION OF CLEAR CELL RENAL CELL CARCINOMA , 2013, Nature.

[35]  Dieter Schmalstieg,et al.  Guided visual exploration of genomic stratifications in cancer , 2014, Nature Methods.

[36]  Fazel Famili,et al.  Evaluation and optimization of clustering in gene expression data analysis , 2004, Bioinform..

[37]  Çagatay Demiralp,et al.  Clustrophile: A Tool for Visual Clustering Analysis , 2017, ArXiv.

[38]  Jeffrey Heer,et al.  D³ Data-Driven Documents , 2011, IEEE Transactions on Visualization and Computer Graphics.

[39]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[40]  R. Sokal,et al.  THE COMPARISON OF DENDROGRAMS BY OBJECTIVE METHODS , 1962 .

[41]  Dieter Schmalstieg,et al.  VisBricks: Multiform Visualization of Large, Inhomogeneous Data , 2011, IEEE Transactions on Visualization and Computer Graphics.

[42]  Dieter Schmalstieg,et al.  Comparative Analysis of Multidimensional, Quantitative Data , 2010, IEEE Transactions on Visualization and Computer Graphics.

[43]  Aleix Prat Aparicio Comprehensive molecular portraits of human breast tumours , 2012 .

[44]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[45]  HeerJeffrey,et al.  D3 Data-Driven Documents , 2011 .