Profiler: integrated statistical analysis and visualization for data quality assessment

Data quality issues such as missing, erroneous, extreme and duplicate values undermine analysis and are time-consuming to find and fix. Automated methods can help identify anomalies, but determining what constitutes an error is context-dependent and so requires human judgment. While visualization tools can facilitate this process, analysts must often manually construct the necessary views, requiring significant expertise. We present Profiler, a visual analysis tool for assessing quality issues in tabular data. Profiler applies data mining methods to automatically flag problematic data and suggests coordinated summary visualizations for assessing the data in context. The system contributes novel methods for integrated statistical and visual analysis, automatic view suggestion, and scalable visual summaries that support real-time interaction with millions of data points. We present Profiler's architecture --- including modular components for custom data types, anomaly detection routines and summary visualizations --- and describe its application to motion picture, natural disaster and water quality data sets.

[1]  Duncan Temple Lang,et al.  GGobi: evolving from XGobi into an extensible framework for interactive data visualization , 2003, Comput. Stat. Data Anal..

[2]  Daniel B. Carr,et al.  Scatterplot matrix techniques for large N , 1986 .

[3]  Mary Czerwinski,et al.  Visualization of mappings between schemas , 2005, CHI.

[4]  Hannu Toivonen,et al.  TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies , 1999, Comput. J..

[5]  Ben Shneiderman,et al.  Interactive Entity Resolution in Relational Data: A Visual Analytic Tool and Its Evaluation , 2008, IEEE Transactions on Visualization and Computer Graphics.

[6]  Matthew O. Ward,et al.  Mapping Nominal Values to Numbers for Effective Visualization , 2003, IEEE Symposium on Information Visualization 2003 (IEEE Cat. No.03TH8714).

[7]  Craig A. Knoblock,et al.  Interactive Data Integration through Smart Copy & Paste , 2009, CIDR.

[8]  Jeffrey Heer,et al.  Wrangler: interactive visual specification of data transformation scripts , 2011, CHI.

[9]  R. Tsay Outliers, Level Shifts, and Variance Changes in Time Series , 1988 .

[10]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[11]  D. Altman,et al.  Missing data , 2007, BMJ : British Medical Journal.

[12]  Helwig Hauser,et al.  Time histograms for large, time-dependent data , 2004, VISSYM'04.

[13]  Jeffrey Nichols,et al.  End-user programming of mashups with vegemite , 2009, IUI.

[14]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[15]  Craig A. Knoblock,et al.  Building Mashups by example , 2008, IUI '08.

[16]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[17]  Matthew O. Ward,et al.  Visual Hierarchical Dimension Reduction for Exploration of High Dimensional Datasets , 2003, VisSym.

[18]  Pat Hanrahan,et al.  Polaris: a system for query, analysis and visualization of multi-dimensional relational databases , 2000, IEEE Symposium on Information Visualization 2000. INFOVIS 2000. Proceedings.

[19]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[20]  A. Karr Exploratory Data Mining and Data Cleaning , 2006 .

[21]  Daniel A. Keim,et al.  HD-Eye: Visual Mining of High-Dimensional Data , 1999, IEEE Computer Graphics and Applications.

[22]  Joseph M. Hellerstein,et al.  Quantitative Data Cleaning for Large Databases , 2008 .

[23]  Jeffrey Heer,et al.  D³ Data-Driven Documents , 2011, IEEE Transactions on Visualization and Computer Graphics.

[24]  Ben Shneiderman,et al.  Systematic yet flexible discovery: guiding domain experts through exploratory data analysis , 2008, IUI '08.

[25]  Chris Weaver Building Highly-Coordinated Visualizations in Improvise , 2004 .

[26]  Heike Hofmann,et al.  Graphics of Large Datasets: Visualizing a Million (Statistics and Computing) , 2006 .

[27]  Laura M. Haas,et al.  Clio grows up: from research prototype to industrial tool , 2005, SIGMOD '05.

[28]  Daniel A. Keim,et al.  Information Visualization and Visual Data Mining , 2002, IEEE Trans. Vis. Comput. Graph..

[29]  Heike Hofmann,et al.  Graphics of Large Datasets: Visualizing a Million , 2006 .

[30]  Diansheng Guo,et al.  Coordinating Computational and Visual Approaches for Interactive Feature Selection and Multivariate Clustering , 2003, Inf. Vis..

[31]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[32]  Ben Shneiderman,et al.  A Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data , 2005, Inf. Vis..

[33]  Chris North,et al.  A user interface for coordinating visualizations based on relational schemata: snap-together visualization , 2000 .

[34]  Jeffrey Heer,et al.  SpanningAspectRatioBank Easing FunctionS ArrayIn ColorIn Date Interpolator MatrixInterpola NumObjecPointI Rectang ISchedu Parallel Pause Scheduler Sequen Transition Transitioner Transiti Tween Co DelimGraphMLCon IData JSONCon DataField DataSc Dat DataSource Data DataUtil DirtySprite LineS RectSprite , 2011 .

[35]  Mary Shaw,et al.  Intelligently creating and recommending reusable reformatting rules , 2009, IUI.

[36]  Doheon Lee,et al.  A Taxonomy of Dirty Data , 2004, Data Mining and Knowledge Discovery.

[37]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.