A Partition-Based Framework for Building and Validating Regression Models

Regression models play a key role in many application domains for analyzing or predicting a quantitative dependent variable based on one or more independent variables. Automated approaches for building regression models are typically limited with respect to incorporating domain knowledge in the process of selecting input variables (also known as feature subset selection). Other limitations include the identification of local structures, transformations, and interactions between variables. The contribution of this paper is a framework for building regression models addressing these limitations. The framework combines a qualitative analysis of relationship structures by visualization and a quantification of relevance for ranking any number of features and pairs of features which may be categorical or continuous. A central aspect is the local approximation of the conditional target distribution by partitioning 1D and 2D feature domains into disjoint regions. This enables a visual investigation of local patterns and largely avoids structural assumptions for the quantitative ranking. We describe how the framework supports different tasks in model building (e.g., validation and comparison), and we present an interactive workflow for feature subset selection. A real-world case study illustrates the step-wise identification of a five-dimensional model for natural gas consumption. We also report feedback from domain experts after two months of deployment in the energy sector, indicating a significant effort reduction for building and improving regression models.

[1]  Ronald D. Snee,et al.  Validation of Regression Models: Methods and Examples , 1977 .

[2]  A. Inselberg,et al.  Parallel coordinates for visualizing multi-dimensional geometry , 1987 .

[3]  J. V. van Wijk,et al.  HyperSlice: visualization of scalar functions of many variables , 1993, VIS '93.

[4]  Peter Filzmoser,et al.  Uncertainty‐Aware Exploration of Continuous Parameter Spaces Using Multivariate Prediction , 2011, Comput. Graph. Forum.

[5]  Daniel A. Keim,et al.  Visual Analytics: Scope and Challenges , 2008, Visual Data Mining.

[6]  John Stasko,et al.  BEST PAPER: A Knowledge Task-Based Framework for Design and Evaluation of Information Visualizations , 2004 .

[7]  Jason Dykes,et al.  Configuring Hierarchical Layouts to Address Research Questions , 2009, IEEE Transactions on Visualization and Computer Graphics.

[8]  James Davey,et al.  Guiding feature subset selection with an interactive visualization , 2011, 2011 IEEE Conference on Visual Analytics Science and Technology (VAST).

[9]  M. Fireman,et al.  MULTIPLE REGRESSION ANALYSIS OF SOIL DATA , 1954 .

[10]  Wolfgang Berger,et al.  Quantifying and Comparing Features in High-Dimensional Datasets , 2008, 2008 12th International Conference Information Visualisation.

[11]  Colin Ware,et al.  Information Visualization: Perception for Design , 2000 .

[12]  Klaus Mueller,et al.  ClusterSculptor: A Visual Analytics Tool for High-Dimensional Data , 2007, 2007 IEEE Symposium on Visual Analytics Science and Technology.

[13]  Carla E. Brodley,et al.  Dis-function: Learning distance functions interactively , 2012, 2012 IEEE Conference on Visual Analytics Science and Technology (VAST).

[14]  Yujie Liu,et al.  Multivariate visual explanation for high dimensional datasets , 2008, 2008 IEEE Symposium on Visual Analytics Science and Technology.

[15]  Joe Michael Kniss,et al.  Eurographics/ Ieee-vgtc Symposium on Visualization 2010 Visualizing Summary Statistics and Uncertainty , 2022 .

[16]  A. Agresti,et al.  Statistical Methods for the Social Sciences , 1979 .

[17]  Alfred Inselberg,et al.  Parallel coordinates for visualizing multi-dimensional geometry , 1987 .

[18]  Bertjan Broeksema,et al.  Capturing the Design Space of Sequential Space-Filling Layouts , 2012, IEEE Transactions on Visualization and Computer Graphics.

[19]  R. Stolzenberg,et al.  Multiple Regression Analysis , 2004 .

[20]  Hanan Samet,et al.  The Quadtree and Related Hierarchical Data Structures , 1984, CSUR.

[21]  Kwan-Liu Ma,et al.  Flow-based scatterplots for sensitivity analysis , 2010, 2010 IEEE Symposium on Visual Analytics Science and Technology.

[22]  Jeffrey Heer,et al.  Enterprise Data Analysis and Visualization: An Interview Study , 2012, IEEE Transactions on Visualization and Computer Graphics.

[23]  Hans-Christian Hege,et al.  Tuner: Principled Parameter Finding for Image Segmentation Algorithms Using Visual Response Surface Exploration , 2011, IEEE Transactions on Visualization and Computer Graphics.

[24]  SametHanan The Quadtree and Related Hierarchical Data Structures , 1984 .

[25]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[26]  Jarke J. van Wijk,et al.  BaobabView: Interactive construction and analysis of decision trees , 2011, 2011 IEEE Conference on Visual Analytics Science and Technology (VAST).

[27]  Roland N. Boubela,et al.  A generic model for the integration of interactive visualization and statistical computing using R , 2012, 2012 IEEE Conference on Visual Analytics Science and Technology (VAST).

[28]  Matthew O. Ward,et al.  Model space visualization for multivariate linear trend discovery , 2009, 2009 IEEE Symposium on Visual Analytics Science and Technology.

[29]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[30]  Daniel A. Keim,et al.  Variable Binned Scatter Plots , 2010, Inf. Vis..

[31]  Robert L. Grossman,et al.  Graph-Theoretic Scagnostics , 2005, INFOVIS.

[32]  Carolina Ruiz,et al.  Pointwise local pattern exploration for sensitivity analysis , 2011, 2011 IEEE Conference on Visual Analytics Science and Technology (VAST).

[33]  J. R. Quinlan Learning With Continuous Classes , 1992 .

[34]  Valerio Pascucci,et al.  Visual Exploration of High Dimensional Scalar Functions , 2010, IEEE Transactions on Visualization and Computer Graphics.

[35]  Helwig Hauser,et al.  Time histograms for large, time-dependent data , 2004, VISSYM'04.

[36]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[37]  M. E. McGill,et al.  Dynamic Graphics for Statistics. , 1990 .

[38]  Michael Friendly,et al.  Extending Mosaic Displays: Marginal, Conditional, and Partial Views of Categorical Data , 1999 .

[39]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[40]  Wolfgang Berger,et al.  A Multi-Threading Architecture to Support Interactive Visual Exploration , 2009, IEEE Transactions on Visualization and Computer Graphics.

[41]  Ben Shneiderman,et al.  A Rank-by-Feature Framework for Unsupervised Multidimensional Data Exploration Using Low Dimensional Projections , 2004, IEEE Symposium on Information Visualization.

[42]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[43]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[44]  Marcus A. Magnor,et al.  Combining automated analysis and visualization techniques for effective exploration of high-dimensional data , 2009, 2009 IEEE Symposium on Visual Analytics Science and Technology.

[45]  Wolfgang Berger,et al.  Eurographics/ Ieee-vgtc Symposium on Visualization 2010 Hypermoval: Interactive Visual Validation of Regression Models for Real-time Simulation , 2022 .

[46]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[47]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[48]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[49]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[50]  John P. Lewis,et al.  Eurographics/ Ieee-vgtc Symposium on Visualization 2009 Selecting Good Views of High-dimensional Data Using Class Consistency , 2022 .

[51]  J. C. Helton,et al.  Uncertainty and Sensitivity Analysis for Models of Complex Systems , 2008 .

[52]  Jay I. Myung,et al.  Global model analysis by parameter space partitioning. , 2019, Psychological review.