Scaled radial axes for interactive visual feature selection: A case study for analyzing chronic conditions

Abstract In statistics, machine learning, and related fields, feature selection is the process of choosing a smaller subset of features to work with. This is an important topic since selecting a subset of features can help analysts to interpret models and data, and to decrease computational runtimes. While many techniques are purely automatic, the data visualization community has produced a number of interactive approaches where users can make decisions taking into account their domain knowledge. In this paper we propose a new visualization technique based on radial axes that allows analysts to perform feature selection effectively, in contrast to previous radial axes methods. This is achieved by employing alternative scaled axes that provide insight regarding the features that have a smaller contribution to the visualizations. Therefore, analysts can use the technique to carry out interactive backwards feature elimination, by discarding the least relevant features according to the information on the plots and their expertise. Our approach can be coupled with any linear dimensionality reduction method, and can be used when performing analyses of cluster structure, correlations, class separability, etc. Specifically, in this paper we focus on combining the proposed technique with methods designed for classification. Lastly, we illustrate the effectiveness of our proposal through a case study analyzing high-dimensional medical chronic conditions data. In particular, clinicians have used the technique for determining the most important features that discriminate between patients with diabetes and high blood pressure.

[1]  James Davey,et al.  Guiding feature subset selection with an interactive visualization , 2011, 2011 IEEE Conference on Visual Analytics Science and Technology (VAST).

[2]  G. Mcnicoll World Population Ageing 1950-2050. , 2002 .

[3]  Haim Levkowitz,et al.  Least Square Projection: A Fast High-Precision Multidimensional Projection Technique and Its Application to Document Mapping , 2008, IEEE Transactions on Visualization and Computer Graphics.

[4]  K. Gabriel,et al.  The biplot graphic display of matrices with application to principal component analysis , 1971 .

[5]  Ivan Bratko,et al.  VizRank: Data Visualization Guided by Machine Learning , 2006, Data Mining and Knowledge Discovery.

[6]  Enrico Bertini,et al.  INFUSE: Interactive Feature Selection for Predictive Modeling of High Dimensional Data , 2014, IEEE Transactions on Visualization and Computer Graphics.

[7]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[8]  Alberto Sánchez,et al.  A comparative study between RadViz and Star Coordinates , 2016, IEEE Transactions on Visualization and Computer Graphics.

[9]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[10]  John T. Stasko,et al.  Low-level components of analytic activity in information visualization , 2005, IEEE Symposium on Information Visualization, 2005. INFOVIS 2005..

[11]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[12]  Matthew O. Ward,et al.  InterRing: an interactive tool for visually navigating and manipulating hierarchical structures , 2002, IEEE Symposium on Information Visualization, 2002. INFOVIS 2002..

[13]  Matthew O. Ward,et al.  Visual Hierarchical Dimension Reduction for Exploration of High Dimensional Datasets , 2003, VisSym.

[14]  Craig S. Miller,et al.  2017 Hypertension guidelines: New opportunities and challenges. , 2018, Journal of the American Dental Association.

[15]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[16]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[17]  Alberto Sánchez,et al.  Adaptable Radial Axes Plots for Improved Multivariate Data Visualization , 2017, Comput. Graph. Forum.

[18]  Richard F. Riesenfeld,et al.  A Survey of Radial Methods for Information Visualization , 2009, IEEE Transactions on Visualization and Computer Graphics.

[19]  Alfred Inselberg,et al.  Parallel coordinates for visualizing multi-dimensional geometry , 1987 .

[20]  Richard F. Averill,et al.  Clinical Risk Groups (CRGs): A Classification System for Risk-Adjusted Capitation-Based Payment and Health Care Management , 2004, Medical care.

[21]  Matthew O. Ward,et al.  Interactive hierarchical dimension ordering, spacing and filtering for exploration of high dimensional datasets , 2003, IEEE Symposium on Information Visualization 2003 (IEEE Cat. No.03TH8714).

[22]  Medicaid Services,et al.  International Classification of Diseases, Ninth Revision, Clinical Modification , 2011 .

[23]  Alexandru Telea,et al.  Interactive Image Feature Selection Aided by Dimensionality Reduction , 2015, EuroVA@EuroVis.

[24]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[25]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[26]  Feiping Nie,et al.  Linear Discriminative Star Coordinates for Exploring Class and Cluster Separation of High Dimensional Data , 2017, Comput. Graph. Forum.

[27]  Eser Kandogan Star Coordinates: A Multi-dimensional Visualization Technique with Uniform Treatment of Dimensions , 2000 .

[28]  I. Jolliffe Principal Component Analysis , 2005 .

[29]  Tamara Munzner,et al.  DimStiller: Workflows for dimensional analysis and reduction , 2010, 2010 IEEE Symposium on Visual Analytics Science and Technology.

[30]  David G. Stork,et al.  Pattern Classification , 1973 .

[31]  Eser Kandogan,et al.  Visualizing multi-dimensional clusters, trends, and outliers using star coordinates , 2001, KDD '01.

[32]  José Luis Rojo-Álvarez,et al.  Predicting colorectal surgical complications using heterogeneous clinical data and kernel methods , 2016, J. Biomed. Informatics.

[33]  I. Mora-Jiménez,et al.  Clinical Risk Groups Analysis for Chronic Hypertensive Patients in Terms of ICD9-CM Diagnosis Codes , 2017, PhyCS.

[34]  Wei Yang,et al.  Fast neighborhood component analysis , 2012, Neurocomputing.

[35]  Alberto Sánchez,et al.  Axis Calibration for Improving Data Attribute Estimation in Star Coordinates Plots , 2014, IEEE Transactions on Visualization and Computer Graphics.

[36]  John T. Stasko,et al.  Toward a Deeper Understanding of the Role of Interaction in Information Visualization , 2007, IEEE Transactions on Visualization and Computer Graphics.

[37]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[38]  Ben Shneiderman,et al.  A Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data , 2005, Inf. Vis..

[39]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[40]  Diansheng Guo,et al.  Coordinating Computational and Visual Approaches for Interactive Feature Selection and Multivariate Clustering , 2003, Inf. Vis..

[41]  Enrico Bertini,et al.  Quality Metrics in High-Dimensional Data Visualization: An Overview and Systematization , 2011, IEEE Transactions on Visualization and Computer Graphics.

[42]  Hugh Tunstall-Pedoe,et al.  Preventing Chronic Diseases. A Vital Investment: WHO Global Report. Geneva: World Health Organization, 2005. pp 200. CHF 30.00. ISBN 92 4 1563001. Also published on http://www.who.int/chp/chronic_disease_report/en/ , 2006 .

[43]  Keke Chen,et al.  VISTA: Validating and Refining Clusters Via Visualization , 2004, Inf. Vis..

[44]  Vanessa Su Lee Goh,et al.  Adaptive and Learning Systems for Signal, Processing, Communications, and Control , 2009 .

[45]  Chieh-Yuan Tsai,et al.  A Clustering-Oriented Star Coordinate Translation Method for Reliable Clustering Parameterization , 2008, PAKDD.

[46]  Ben Shneiderman,et al.  The eyes have it: a task by data type taxonomy for information visualizations , 1996, Proceedings 1996 IEEE Symposium on Visual Languages.

[47]  Bruce Neal,et al.  1999 World Health Organization-International Society of Hypertension Guidelines for the Management of Hypertension. Guidelines Subcommittee. , 1999, Journal of hypertension.

[48]  Mehmet A. Orgun,et al.  HOV3: An Approach to Visual Cluster Analysis , 2006, ADMA.

[49]  María José del Jesús,et al.  KEEL: a software tool to assess evolutionary algorithms for data mining problems , 2008, Soft Comput..

[50]  S. Johansson,et al.  Interactive Dimensionality Reduction Through User-defined Combinations of Quality Metrics , 2009, IEEE Transactions on Visualization and Computer Graphics.

[51]  E. Oja,et al.  Independent Component Analysis , 2013 .

[52]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[53]  Yang Sun,et al.  An improved multivariate data visualization technique , 2008, 2008 International Conference on Information and Automation.

[54]  Daniel A. Keim,et al.  Subspace search and visualization to make sense of alternative clusterings in high-dimensional data , 2012, 2012 IEEE Conference on Visual Analytics Science and Technology (VAST).