Significance of Patterns in Data Visualisations

In this paper we consider the following important problem: when we explore data visually and observe patterns, how can we determine their statistical significance? Patterns observed in exploratory analysis are traditionally met with scepticism, since the hypotheses are formulated while viewing the data, rather than before doing so. In contrast to this belief, we show that it is, in fact, possible to evaluate the significance of patterns also during exploratory analysis, and that the knowledge of the analyst can be leveraged to improve statistical power by reducing the amount of simultaneous comparisons. We develop a principled framework for determining the statistical significance of visually observed patterns. Furthermore, we show how the significance of visual patterns observed during iterative data exploration can be determined. We perform an empirical investigation on real and synthetic tabular data and time series, using different test statistics and methods for generating surrogate data. We conclude that the proposed framework allows determining the significance of visual patterns during exploratory analysis.

[1]  Tijl De Bie,et al.  Subjectively Interesting Component Analysis: Data Projections that Contrast with Prior Expectations , 2016, KDD.

[2]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[3]  Kristin A. Cook,et al.  Illuminating the Path: The Research and Development Agenda for Visual Analytics , 2005 .

[4]  J. I The Design of Experiments , 1936, Nature.

[5]  Stephen E. Fienberg,et al.  Testing Statistical Hypotheses , 2005 .

[6]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[7]  Leland Wilkinson,et al.  Scagnostics Distributions , 2008 .

[8]  Chris Woolston Psychology journal bans P values , 2015, Nature.

[9]  Antony Unwin If You Can't See the Pattern, Is It There? , 2002, Pattern Detection and Discovery.

[10]  Jessica Lin,et al.  Finding Motifs in Time Series , 2002, KDD 2002.

[11]  Stefan Wrobel,et al.  One click mining: interactive local pattern discovery through implicit preference and performance learning , 2013, IDEA@KDD.

[12]  Daniel A. Keim,et al.  Visual Analytics , 2009, Encyclopedia of Database Systems.

[13]  Tijl De Bie,et al.  Interactive Visual Data Exploration with Subjective Feedback , 2016, ECML/PKDD.

[14]  Michał Krawczyk,et al.  The Search for Significance: A Few Peculiarities in the Distribution of P Values in Experimental Psychology Literature , 2015, PloS one.

[15]  Daniel A. Keim,et al.  Challenges in Visual Data Analysis , 2006, Tenth International Conference on Information Visualisation (IV'06).

[16]  Tim Kraska,et al.  Investigating the Effect of the Multiple Comparisons Problem in Visual Analysis , 2018, CHI.

[17]  James Theiler,et al.  Testing for nonlinearity in time series: the method of surrogate data , 1992 .

[18]  A. Agresti An introduction to categorical data analysis , 1997 .

[19]  Panagiotis Papapetrou,et al.  Visually Controllable Data Mining Methods , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[20]  Dianne Cook,et al.  Measuring Lineup Difficulty By Matching Distance Metrics With Subject Choices in Crowd-Sourced Data , 2018 .

[21]  Joe Michael Kniss,et al.  Visualizing Summary Statistics and Uncertainty , 2010, Comput. Graph. Forum.

[22]  Petteri Kaski,et al.  Significance of Patterns in Time Series Collections , 2011, SDM.

[23]  Boris Müller,et al.  Probing Projections: Interaction Techniques for Interpreting Arrangements and Errors of Dimensionality Reductions , 2016, IEEE Transactions on Visualization and Computer Graphics.

[24]  Leland Wilkinson,et al.  TimeSeer: Scagnostics for High-Dimensional Time Series , 2013, IEEE Transactions on Visualization and Computer Graphics.

[25]  Wade Robison Representation and misrepresentation: Tufte and the Morton Thiokol engineers on the Challenger , 2002, Science and engineering ethics.

[26]  Deborah F. Swayne,et al.  Statistical inference for exploratory data analysis and model diagnostics , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[27]  J. Tukey Data analysis, computation and mathematics , 1972 .

[28]  KeoghEamonn,et al.  Time series shapelets , 2011 .

[29]  Tijl De Bie,et al.  A Constrained Randomization Approach to Interactive Visual Data Exploration with Subjective Feedback , 2020, IEEE Transactions on Knowledge and Data Engineering.

[30]  N. Lazar,et al.  The ASA Statement on p-Values: Context, Process, and Purpose , 2016 .

[31]  John T. Behrens,et al.  Principles and procedures of exploratory data analysis. , 1997 .

[32]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[33]  Heike Hofmann,et al.  Using visual statistical inference to better understand random class separations in high dimension, low sample size data , 2015, Comput. Stat..

[34]  S. Goodman A dirty dozen: twelve p-value misconceptions. , 2008, Seminars in hematology.

[35]  S. Kobourov,et al.  Same Stats, Different Graphs (Graph Statistics and Why We Need Graph Drawings) , 2018, GD.

[36]  Geoffrey I. Webb,et al.  A tutorial on statistically sound pattern discovery , 2017, Data Mining and Knowledge Discovery.

[37]  Jörg Sander,et al.  Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering , 2008, KDD.

[38]  Daniel A. Keim,et al.  Guiding the Exploration of Scatter Plot Data Using Motif-Based Interest Measures , 2015, 2015 Big Data Visual Analytics (BDVA).

[39]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[40]  Leland Wilkinson,et al.  Visual pattern discovery using random projections , 2012, 2012 IEEE Conference on Visual Analytics Science and Technology (VAST).

[41]  Thomas E. Nichols,et al.  Nonparametric permutation tests for functional neuroimaging: A primer with examples , 2002, Human brain mapping.

[42]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[43]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[44]  Holly M. Widen,et al.  Graphical Inference in Geographical Research , 2016 .

[45]  Tijl De Bie,et al.  Interactive visual data exploration with subjective feedback: an information-theoretic approach , 2017, Data Mining and Knowledge Discovery.

[46]  Leland Wilkinson,et al.  Visual pattern detection in high-dimensional spaces , 2012 .

[47]  Christopher Rao,et al.  Graphs in Statistical Analysis , 2010 .

[48]  E. Massera,et al.  On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario , 2008 .

[49]  Michael Correll Improving Visual Statistics , 2015 .

[50]  Jarkko Venna,et al.  Trustworthiness and metrics in visualizing similarity of gene expression , 2003, BMC Bioinformatics.

[51]  Rajiv S. Menjoge,et al.  New procedures for visualizing data and diagnosing regression models , 2010 .

[52]  Kai Puolamäki,et al.  Tiler: Software for Human-Guided Data Exploration , 2018, ECML/PKDD.

[53]  Alan M. MacEachren Visual Analytics and Uncertainty: Its Not About the Data , 2015, EuroVA@EuroVis.

[54]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[55]  Heike Hofmann,et al.  Graphical inference for infovis , 2010, IEEE Transactions on Visualization and Computer Graphics.

[56]  Edward Rolf Tufte,et al.  The visual display of quantitative information , 1985 .

[57]  Daniel A. Keim,et al.  Visual Analytics: Scope and Challenges , 2008, Visual Data Mining.

[58]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[59]  Daniel A. Keim,et al.  The Role of Uncertainty, Awareness, and Trust in Visual Analytics , 2016, IEEE Transactions on Visualization and Computer Graphics.

[60]  Heikki Mannila,et al.  Tell me something I don't know: randomization strategies for iterative data mining , 2009, KDD.

[61]  P. Sham,et al.  A note on the calculation of empirical P values from Monte Carlo procedures. , 2002, American journal of human genetics.

[62]  G. Gigerenzer Mindless statistics , 2004 .

[63]  R. Grossman,et al.  Graph-theoretic scagnostics , 2005, IEEE Symposium on Information Visualization, 2005. INFOVIS 2005..

[64]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[65]  Miikka Dal Maso,et al.  Formation and growth of fresh atmospheric aerosols: eight years of aerosol size distribution data from SMEAR II, Hyytiälä, Finland , 2005 .

[66]  Karsten Klein,et al.  A review and outlook on visual analytics for uncertainties in functional magnetic resonance imaging , 2018, Brain Informatics.

[67]  William J. McGuire,et al.  Psychology of science: A perspectivist approach to the strategic planning of programmatic scientific research , 1989 .

[68]  Heike Hofmann,et al.  Validation of Visual Statistical Inference, Applied to Linear Models , 2013 .

[69]  Aristides Gionis,et al.  Assessing data mining results via swap randomization , 2007, TKDD.

[70]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[71]  Guimei Liu,et al.  Controlling False Positives in Association Rule Mining , 2011, Proc. VLDB Endow..

[72]  T. Schreiber,et al.  Surrogate time series , 1999, chao-dyn/9909037.

[73]  Johannes Lenhard,et al.  Models and Statistical Inference: The Controversy between Fisher and Neyman–Pearson , 2006, The British Journal for the Philosophy of Science.

[74]  Leland Wilkinson,et al.  Transforming Scagnostics to Reveal Hidden Features , 2014, IEEE Transactions on Visualization and Computer Graphics.

[75]  John W. Tukey,et al.  A Projection Pursuit Algorithm for Exploratory Data Analysis , 1974, IEEE Transactions on Computers.

[76]  David J. Hand,et al.  Significance tests for patterns in continuous data , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[77]  Gemma Lancaster,et al.  Surrogate data for hypothesis testing of physical systems , 2018, Physics Reports.

[78]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[79]  Dimitris Kugiumtzis,et al.  Surrogate Data Test on Time Series , 2002 .

[80]  Panagiotis Papapetrou,et al.  A statistical significance testing approach to mining the most informative set of patterns , 2012, Data Mining and Knowledge Discovery.

[81]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[82]  Leland Wilkinson,et al.  ScagExplorer: Exploring Scatterplots by Their Scagnostics , 2014, 2014 IEEE Pacific Visualization Symposium.

[83]  Pasi Aalto,et al.  Smart-SMEAR: on-line data exploration and visualization tool for SMEAR stations , 2009 .

[84]  Jing Luan,et al.  Data Mining and Its Applications in Higher Education , 2002 .

[85]  James Theiler,et al.  Constrained-realization Monte-carlo Method for Hypothesis Testing , 1996 .

[86]  Justin Talbot,et al.  Automatic Selection of Partitioning Variables for Small Multiple Displays , 2016, IEEE Transactions on Visualization and Computer Graphics.

[87]  Chong Ho Yu,et al.  Exploratory data analysis in the context of data mining and resampling. , 2010 .

[88]  Kai Puolamäki,et al.  Explaining Interval Sequences by Randomization , 2013, ECML/PKDD.

[89]  Howard J. Hamilton,et al.  A Unified Framework for Utility Based Measures for Mining Itemsets , 2006 .

[90]  Tijl De Bie,et al.  Subjective Interestingness in Exploratory Data Mining , 2013, IDA.

[91]  Robert L. Grossman,et al.  High-Dimensional Visual Analytics: Interactive Exploration Guided by Pairwise Views of Point Distributions , 2006, IEEE Transactions on Visualization and Computer Graphics.

[92]  Daniel A. Keim,et al.  Guiding the exploration of scatter plot data using motif-based interest measures , 2016, J. Vis. Lang. Comput..

[93]  J HamiltonHoward,et al.  Interestingness measures for data mining , 2006 .

[94]  Andrew Gelman,et al.  Exploratory Data Analysis for Complex Models , 2004 .

[95]  Tak-Chung Fu,et al.  A review on time series data mining , 2011, Eng. Appl. Artif. Intell..

[96]  Kai Puolamäki,et al.  Guided Visual Exploration of Relations in Data Sets , 2019, ArXiv.

[97]  L. Wasserman,et al.  False discovery control with p-value weighting , 2006 .

[98]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[99]  H. Pashler,et al.  Editors’ Introduction to the Special Section on Replicability in Psychological Science , 2012, Perspectives on psychological science : a journal of the Association for Psychological Science.

[100]  Heike Hofmann,et al.  Visualizing statistical models: Removing the blindfold , 2015, Stat. Anal. Data Min..

[101]  S. Chatterjee,et al.  Generating Data with Identical Statistics but Dissimilar Graphics , 2007 .

[102]  Michael H. Böhlen,et al.  Visual Data Mining - Theory, Techniques and Tools for Visual Analytics , 2008, Visual Data Mining.

[103]  Peter H. Westfall,et al.  Multiple testing methodology , 2009 .

[104]  Geoffrey I. Webb,et al.  Preliminary investigations into statistically valid exploratory rule discovery , 2003 .

[105]  Myke Gluck,et al.  Visual Explanations: Images and Quantities, Evidence and Narrative , 1997, Inf. Process. Manag..

[106]  Geoffrey I. Webb Discovering Significant Patterns , 2007, Machine Learning.