Dimensionality Reduction in the Wild : Gaps and Guidance

Despite an abundance of technical literature on dimension reduction (DR), our understanding of how real data analysts are using DR techniques and what problems they face remains largely incomplete. In this paper, we contribute the first systematic and broad analysis of DR usage by a sample of real data analysts, along with their needs and problems. We present the results of a two-year qualitative research endeavor, in which we iteratively collected and analyzed a rich corpus of data in the spirit of grounded theory. We interviewed 24 data analysts from different domains and surveyed papers depicting applications of DR. The result is a descriptive taxonomy of DR usage, and concrete real-world usage examples summarized in terms of this taxonomy. We also identify seven gaps where user DR needs are unfulfilled by currently available techniques, and three mismatches where the users do not need offered techniques. At the heart of our taxonomy is a task classification that differentiates between abstract tasks related to point clusters and those related to dimensions. The taxonomy and usage examples are intended to provide a better descriptive understanding of real data analysts’ practices and needs with regards to DR. The gaps are intended as prescriptive pointers to future research directions, with the most important gaps being a lack of support for users without expertise in the mathematics of DR, and an absence of DR techniques for comparing explicit groups of dimensions or for relating non-linear embeddings to original dimensions.

[1]  A. Householder,et al.  Discussion of a set of points in terms of their mutual distances , 1938 .

[2]  W. Torgerson Multidimensional scaling: I. Theory and method , 1952 .

[3]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[4]  W. Buxton Human-Computer Interaction , 1988, Springer Berlin Heidelberg.

[5]  Desmond G. Higgins Sequence ordinations: a multivariate analysis approach to analysing large sequence data sets , 1992, Comput. Appl. Biosci..

[6]  J. McGrath Methodology matters: doing research in the behavioral and social sciences , 1995 .

[7]  Ben Shneiderman,et al.  The eyes have it: a task by data type taxonomy for information visualizations , 1996, Proceedings 1996 IEEE Symposium on Visual Languages.

[8]  Bonnie A. Nardi,et al.  Interaction and outeraction: instant messaging in action , 2000, CSCW '00.

[9]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[10]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Michael Friendly,et al.  Visualizing Categorical Data , 2009, Encyclopedia of Database Systems.

[12]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[13]  William M. K. Trochim,et al.  Research methods knowledge base , 2001 .

[14]  Mukund Balasubramanian,et al.  The Isomap Algorithm and Topological Stability , 2002, Science.

[15]  Andreas Buja,et al.  Visualization Methodology for Multidimensional Scaling , 2002, J. Classif..

[16]  Matthew Brand,et al.  Charting a Manifold , 2002, NIPS.

[17]  Matthew O. Ward,et al.  Interactive hierarchical dimension ordering, spacing and filtering for exploration of high dimensional datasets , 2003, IEEE Symposium on Information Visualization 2003 (IEEE Cat. No.03TH8714).

[18]  Matthew West,et al.  Bayesian factor regression models in the''large p , 2003 .

[19]  Wojciech Matusik,et al.  A data-driven reflectance model , 2003, ACM Trans. Graph..

[20]  Guy Perrière,et al.  Cross-platform comparison and visualisation of gene expression data using co-inertia analysis , 2003, BMC Bioinformatics.

[21]  Paul Dourish,et al.  Security in the wild: user strategies for managing security as an everyday, practical problem , 2004, Personal and Ubiquitous Computing.

[22]  John T. Stasko,et al.  BEST PAPER: A Knowledge Task-Based Framework for Design and Evaluation of Information Visualizations , 2004, IEEE Symposium on Information Visualization.

[23]  Tamara Munzner,et al.  Steerable, Progressive Multidimensional Scaling , 2004, IEEE Symposium on Information Visualization.

[24]  James R. Eagan,et al.  Low-level components of analytic activity in information visualization , 2005, IEEE Symposium on Information Visualization, 2005. INFOVIS 2005..

[25]  Guy Perrière,et al.  MADE4: an R package for multivariate analysis of gene expression data , 2005, Bioinform..

[26]  Marie-Paule Cani,et al.  Morphable model of quadrupeds skeletons for animating 3D animals , 2005, SCA '05.

[27]  E. Brink,et al.  Constructing grounded theory : A practical guide through qualitative analysis , 2006 .

[28]  Alfred M. Bruckstein,et al.  Matching Two-Dimensional Articulated Shapes Using Generalized Multidimensional Scaling , 2006, AMDO.

[29]  Desmond G. Higgins,et al.  Supervised multivariate analysis of sequence groups to identify specificity determining residues , 2007, BMC Bioinformatics.

[30]  Joseph J. Hale,et al.  From Disorder to Order in Marching Locusts , 2006, Science.

[31]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[32]  Aedín C Culhane,et al.  A multivariate analysis approach to the integration of proteomic and gene expression data , 2007, Proteomics.

[33]  G. Vries,et al.  Modeling Group Formation and Activity Patterns in Self-Organizing Collectives of Individuals , 2007, Bulletin of mathematical biology.

[34]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[35]  M. Sheelagh T. Carpendale,et al.  Grounded evaluation of information visualizations , 2008, BELIV.

[36]  Ghassan Hamarneh,et al.  Kinetic Modeling Based Probabilistic Segmentation for Molecular Images , 2008, MICCAI.

[37]  Sheryl Staub-French,et al.  Qualitative analysis of visualization: a building design field study , 2008, BELIV.

[38]  Joanna McGrenere,et al.  Evaluation of a role-based approach for customizing a complex development environment , 2008, CHI.

[39]  Sushant Sachdeva,et al.  Dimension Reduction , 2008, Encyclopedia of GIS.

[40]  Ann Blandford,et al.  Usability Work in Professional Website Design: Insights from Practitioners' Perspectives , 2008, Maturing Usability.

[41]  Marc Olano,et al.  Glimmer: Multilevel MDS on the GPU , 2009, IEEE Transactions on Visualization and Computer Graphics.

[42]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[43]  Yoav Shoham,et al.  Empirical hardness models: Methodology and a case study on combinatorial auctions , 2009, JACM.

[44]  S. Johansson,et al.  Interactive Dimensionality Reduction Through User-defined Combinations of Quality Metrics , 2009, IEEE Transactions on Visualization and Computer Graphics.

[45]  Ian B. Jeffery,et al.  Detecting microRNA activity from gene expression data , 2010, BMC Bioinformatics.

[46]  Inanç Birol,et al.  ABySS-Explorer: Visualizing Genome Sequence Assemblies , 2009, IEEE Transactions on Visualization and Computer Graphics.

[47]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[48]  Jarke J. van Wijk,et al.  What Does the User Want to See? What do the Data Want to Be? , 2009, Inf. Vis..

[49]  Billur Barshan,et al.  Classifying Human Leg Motions with Uniaxial Piezoelectric Gyroscopes , 2009, Sensors.

[50]  Kevin Leyton-Brown,et al.  Tradeoffs in the empirical evaluation of competing algorithm designs , 2010, Annals of Mathematics and Artificial Intelligence.

[51]  Desmond G. Higgins,et al.  A Complete Analysis of HA and NA Genes of Influenza A Viruses , 2010, PloS one.

[52]  Billur Barshan,et al.  Comparative study on classifying human activities with miniature inertial and magnetic sensors , 2010, Pattern Recognit..

[53]  Kevin P. Murphy,et al.  Time-Bounded Sequential Parameter Optimization , 2010, LION.

[54]  Tamara Munzner,et al.  DimStiller: Workflows for dimensional analysis and reduction , 2010, 2010 IEEE Symposium on Visual Analytics Science and Technology.

[55]  Desmond G. Higgins,et al.  Sequence embedding for fast construction of guide trees for multiple sequence alignment , 2010, Algorithms for Molecular Biology.

[56]  Thomas Bartz-Beielstein,et al.  Sequential Model-Based Parameter Optimization: an Experimental Investigation of Automated and Interactive Approaches , 2010, Experimental Methods for the Analysis of Optimization Algorithms.

[57]  Ghassan Hamarneh,et al.  ProbExplorer: Uncertainty‐guided Exploration and Editing of Probabilistic Medical Image Segmentation , 2010, Comput. Graph. Forum.

[58]  Ghassan Hamarneh,et al.  Fast Random Walker with Priors Using Precomputation for Interactive Medical Image Segmentation , 2010, MICCAI.

[59]  Kevin Leyton-Brown,et al.  Beyond equilibrium: predicting human behaviour in normal form games , 2010, AAAI.

[60]  Ghassan Hamarneh,et al.  Exploration and Visualization of Segmentation Uncertainty using Shape and Appearance Prior Information , 2010, IEEE Transactions on Visualization and Computer Graphics.

[61]  Raluca Eftimie,et al.  An investigation of a nonlocal hyperbolic model for self-organization of biological groups , 2010, Journal of mathematical biology.

[62]  Carrie A. Holt,et al.  Evaluating Benchmarks of Population Status for Pacific Salmon , 2011 .

[63]  Sareh Nabi-Abdolyousefi,et al.  Equilibria of a nonlocal model for biological aggregations: linear stability and bifurcation studies , 2011 .

[64]  Enrico Bertini,et al.  Quality Metrics in High-Dimensional Data Visualization: An Overview and Systematization , 2011, IEEE Transactions on Visualization and Computer Graphics.

[65]  Roy A. Ruddle,et al.  Visualization of Parameter Space for Image Analysis , 2011, IEEE Transactions on Visualization and Computer Graphics.

[66]  Hans-Christian Hege,et al.  Tuner: Principled Parameter Finding for Image Segmentation Algorithms Using Visual Response Surface Exploration , 2011, IEEE Transactions on Visualization and Computer Graphics.

[67]  John T. Stasko,et al.  Characterizing the intelligence analysis process: Informing visual analytics design through a longitudinal field study , 2011, 2011 IEEE Conference on Visual Analytics Science and Technology (VAST).

[68]  J. Douglas Carroll,et al.  Two-Way Multidimensional Scaling: A Review , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[69]  John D. Westbrook,et al.  The Structural Biology Knowledgebase: a portal to protein structures, sequences, functions, and methods , 2011, Journal of Structural and Functional Genomics.

[70]  Andreas Butz,et al.  Listening factors: a large-scale principal components analysis of long-term music listening histories , 2012, CHI.

[71]  Tamara Munzner,et al.  Vismon: Facilitating Analysis of Trade‐Offs, Uncertainty, and Sensitivity In Fisheries Management Decision Making , 2012, Comput. Graph. Forum.

[72]  Tamara Munzner,et al.  Design Study Methodology: Reflections from the Trenches and the Stacks , 2012, IEEE Transactions on Visualization and Computer Graphics.

[73]  M. Sheelagh T. Carpendale,et al.  Empirical Studies in Information Visualization: Seven Scenarios , 2012, IEEE Transactions on Visualization and Computer Graphics.

[74]  Raluca Eftimie,et al.  Hyperbolic and kinetic models for self-organized biological aggregations and movement: a brief review , 2012, Journal of mathematical biology.

[75]  Michael C. Hout,et al.  Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.

[76]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[77]  K. Charmaz,et al.  Constructing Grounded Theory , 2014 .