From Vision Science to Data Science: Applying Perception to Problems in Big Data

In the era of big data, along with machine learning and databases, visualization has become critical to managing complex and overwhelming data problems. Vision science has been a foundation of data visualization for decades. As the systems that use visualization become more complex, advances in vision science are needed to provide fundamental theory to visualization researchers and practitioners to address emerging challenges. In this paper, we present our work on modeling the perception of correlation in bivariate visualizations using the Weber’s Law. These Weber models can be applied to definitively compare and evaluate the effectiveness of these visualizations. We further demonstrate that the reason for this finding is that people approximate correlation using visual features that are known to follow the Weber’s Law. These findings have multiple implications. One practical implication is that results like these can guide practitioners in choosing the appropriate visualization. In the context of big data, this result can lead to perceptually-driven computational techniques. For instance, it could be used for quickly sampling from big data in a way that preserves important data features, which can lead to better computational performance, a less overwhelming user experience, and more fluid interaction. Introduction The rise of data science, spurred by the growth of data sizes and complexity, has led to new advances in the fields of databases, machine learning, and visualization. These three pillars enable data stakeholders to store, analyze, and make sense of big data. Of the three areas, visualization represents the last step of the pipeline where automated computation meets the human user. Recent advances in visualization techniques have led to innovative systems that allow the user to interactively and visually explore large amounts of data. Success stories such as Tableau [1, 22], SpotFire [4], SAS Visual Analytics [2] demonstrate the importance of integrating visualization with machine learning and databases to solve big data problems. However, as the data size and complexity continue to rise, it has become more obvious that the visualization component has become both the critical element as well as the bottleneck in the analysis pipeline. Both the database and machine learning can scale to meet the increased data complexity by adding more storage and more compute nodes in a server farm. The visualization component, on the other hand, is constrained by both the display technology as well as the human user’s perceptual and cognitive limitations. In this paper, we examine the constraints of the visualization component in the context of big data analytics. While these constraints can be considered as limitations to the data analysis pipeline, we propose that they also represent opportunities to develop a new user-centric paradigm that makes use of vision science to design not only new visualization techniques, but also database and machine learning algorithms. The resulting system represents a new approach of big data analytics that puts the human user’s needs and limitations first, thereby creating a system that is faster, more fluid, and more intuitive to the user. Background Figure 1 shows a traditional (non-interactive) process of data visualization (adopted from the data state reference model by Chi [7]). First the data is retrieved from the database into the visualization system. The system then maps elements of the data to different perceptual elements (such as color, size, shape, etc.) [5]. Lastly, the human user perceives the image and identifies patterns and trends that might lead to new insights about the data. Although simplistic, this pipeline serves as the foundation of all visual analytics systems today. Recent advances in this topic can be seen as improving the stages in this pipeline. For example, nanocubes [14] and multivariate data tiles [15] are examples of data storage techniques that make use of compact data structures that aggregate underlying data in a hierarchical way. These data summaries can be precomputed at various levels of abstraction based on the number of pixels available for the visualization and the size of the underlying raw dataset. Binned aggregation [15], [24] takes this even further by separating the raw data into bins and returning a small set of summary statistics. This technique can show both densities and outliers by varying the bin size. Any issues with variability in the summaries can be resolved with various smoothing methods [24]. Another technique is to provide approximate incremental answers. The sampleAction [9] and the VisReduce [12] system incrementally returns partial answers to user queries computed over increasingly larger samples of data. This has the benefit of providing a partial response to an exploratory query quickly and once the user has a good enough answer, they can stop the process and move on. For exact answers from raw data, systems such as Dremel [16] and MapD [17] take advantage of parallelism and large computing clusters for computational power. Although effective, the cost and proprietary query language can hinder widespread adoption. While these new techniques, methods, and systems have led to a faster and more efficient data visualization process, our goal in this paper is fundamentally different. Unlike these advances that seek to improve a component of the pipeline, we propose ©2016 Society for Imaging Science and Technology DOI: 10.2352/ISSN.2470-1173.2016.16HVEI-131 IS&T International Symposium on Electronic Imaging 2016 Human Vision and Electronic Imaging 2016 HVEI-131.1 Figure 1: A simplified pipeline of data visualization. The data is first fetched from the database and delivered to a client system to render a visualization. The human user perceives information from the visualization. that a new paradigm of the data visualization process can lead to advances in vision science, visualization techniques, and closer integration of machine learning, database, and visualization. Human Perceptual Limitations In order to develop a paradigm that focuses on the limitations of the human perceptual and cognitive abilities, we first examine some examples of low-level limitations in the data visualization process. Consider an example of a visualization display that has a resolution of 1000x1000 pixels resulting in a total of 1 million pixels, each with the capability of displaying three color channels. When used in a visualization, it has been shown that this 1 million pixel (the resolution of the display) is the theoretical upperbound of the maximum amount of information that the human can perceive [6]. This theoretical upper bound is important because it suggests that the first step in visualization pipeline shown in Figure 1 is lossy when displaying a large amount of data. For example, imagine a database that holds 10 million records of data. When the 10 million records are sent to the visualization system, the 10 million records need to be “compressed” into 1 million pixels resulting in a 10:1 ratio of data loss. The “compression” can be performed using a variety of methods. Most commonly the data is aggregated (averaged) into a single value, but other methods such as clustering and sampling are also frequently used [20]. In addition, beyond the theoretical limitation of the display technology, the second step in the visualization pipeline is also Figure 2: A screen with a resolution of 1000 x 1000 can at most display 1 million pixels. When a visualization reaches this upper bound, however, the resulting image is often unrecognizable. lossy. While the display resolution constraints what the user is able to perceive, comprehending the visualization is further constrained by the user’s cognitive limitations. For example, using the previous example, when each pixel represents 10 data elements in an aggregated fashion, the visualization can result in a colorful “snow” (see Figure 2). Although the data-visual mapping of this visualization may be coherent, accurate, and maximizing of information content, the cognitive limitation of the user makes this visualization less than useful [6]. Applying the Perceptual Limitations Based on the user’s perceptual and cognitive limitations, we propose that there are two immediate opportunities for optimizing the design of a visualization system. Pixel-Based Constraint First, for a display system that can render at most 1 million “pieces” of data, it does not make sense for a database to transfer more than 1 million rows to the visualization system with that display. Since transferring data from the database to the visualization can be costly (especially when the two are connected via network), minimizing the amount of data transferred from the database will improve the performance of the overall system. It is relevant to note that the 1 million rows of data transferred from the database can be raw or processed data. Using sampling techniques [3], the database can choose the most representative 1 million raw data elements. Alternatively, using aggregation or clustering techniques, each of the 1 million rows can represent the mean of a large number of raw data elements. When combined with the notion that data transfer is costly, this implies that most of the data processing should take place in the database system. Only the resulting computed data should be sent to the visualization system for rendering. Perceptually-Based Constraint Second, we consider and leverage the user’s perceptual and cognitive limitations when perceiving an image. For example, Figure 3 shows two images that appear very similar. However, the image on the right has a significantly coarser resolution than the one on the left (301 kb vs. 115 kb, a 2.62:1 compression ratio). Many existing image compression techniques (such as JPEG-2000 [23]) are based on this same idea: for as long as the user cannot tell the difference, keep reducing

[1]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[2]  Carlos Eduardo Scheidegger,et al.  Nanocubes for Real-Time Exploration of Spatiotemporal Datasets , 2013, IEEE Transactions on Visualization and Computer Graphics.

[3]  W. Cleveland,et al.  Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods , 1984 .

[4]  Jeffrey Heer,et al.  Beyond Weber's Law: A Second Look at Ranking Visualizations of Correlation , 2016, IEEE Transactions on Visualization and Computer Graphics.

[5]  Jacques Bertin,et al.  Semiology of Graphics - Diagrams, Networks, Maps , 2010 .

[6]  Steven Franconeri,et al.  Ranking Visualizations of Correlation Using Weber's Law , 2014, IEEE Transactions on Visualization and Computer Graphics.

[7]  Min Chen,et al.  An Information-theoretic Framework for Visualization , 2010, IEEE Transactions on Visualization and Computer Graphics.

[8]  Ronald A. Rensink On the Prospects for a Science of Visualization , 2014, Handbook of Human Centric Visualization.

[9]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[10]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[11]  Michael J. McGuffin,et al.  VisReduce: Fast and responsive incremental information visualization of large datasets , 2013, 2013 IEEE International Conference on Big Data.

[12]  Michael W. Marcellin,et al.  JPEG2000 - image compression fundamentals, standards and practice , 2002, The Kluwer International Series in Engineering and Computer Science.

[13]  Ronald A. Rensink,et al.  The Perception of Correlation in Scatterplots , 2010, Comput. Graph. Forum.

[14]  Ben Shneiderman,et al.  Extreme visualization: squeezing a billion records into a million pixels , 2008, SIGMOD Conference.

[15]  Monica M. C. Schraefel,et al.  Trust me, i'm partially right: incremental visualization lets analysts explore large datasets faster , 2012, CHI.

[16]  Christopher Ahlberg,et al.  Spotfire: an information exploration environment , 1996, SGMD.

[17]  Jeffrey Heer,et al.  imMens: Real‐time Visual Querying of Big Data , 2013, Comput. Graph. Forum.

[18]  H. Wickham Bin-summarise-smooth : A framework for visualising large data , 2013 .

[19]  Pat Hanrahan,et al.  Polaris: a system for query, analysis and visualization of multi-dimensional relational databases , 2000, IEEE Symposium on Information Visualization 2000. INFOVIS 2000. Proceedings.

[20]  S. S. Stevens On the psychophysical law. , 1957, Psychological review.

[21]  Ed H. Chi,et al.  A taxonomy of visualization techniques using the data state reference model , 2000, IEEE Symposium on Information Visualization 2000. INFOVIS 2000. Proceedings.