Visualizations for High Dimensional Data Mining - Table Visualizations

Visualizations that can handle flat files, or simple table data are most often used in data mining. In this paper we survey most visualizations that can handle more than three dimensions and fit our definition of Table Visualizations. We define Table Visualizations and some additional terms needed for the Table Visualization descriptions. For a preliminary evaluation of some of these visualizations see “Benchmark Development for the Evaluation of Visualization for Data Mining” also included in this volume. Data Sets Used Most of the datasets for the visualization examples are either the automobile or the Iris flower dataset. Nearly every data mining package comes with at least one of these two datasets. The datasets are available UC Irvine Machine Learning Repository [Uci97]. • Iris Plant Flowers – from Fischer 1936, physical measurements from three types of flowers. • Car (Automobile) – data concerning cars manufactured in America, Japan and Europe from 1970 to 1982 Definition of Table Visualizations A two-dimensional table of data is defined by M rows and N columns. A visualization of this data is termed a Table Visualization. In our definition, we define the columns to be the dimensions or the variates (also called fields or attributes), and the rows to be the data records. The data records are sometimes called ndimensional points, or cases. For a more thorough discussion of the table model, see [Car99]. This very general definition only rules out some structured or hierarchical data. In the most general case, a visualization maps certain dimensions to certain features in the visualization. In geographical, scientific, and imaging visualizations, the spatial dimensions are normally assigned to the appropriate X, Y or Z spatial dimension. In a typical information visualization there is no inherent spatial dimension, but quite often the dimension mapped to height and width on the screen has a dominating effect. For example in a scatter plot of four-dimensional data one could map two features to the Xand Y-axis and the other two features to the color and shape of the plotted points. The dimensions assigned to the Xand Y-axis would dominate many aspects of analysis, such as clustering and outlier detection. Some Table Visualizations such as Parallel Coordinates, Survey Plots, or Radviz, treat all of the data dimensions equally. We call these Regular Table Visualizations (RTVs). The data in a Table Visualizations is discrete. The data can be represented by different types, such as integer, real, categorical, nominal, etc. In most visualizations all data is converted to a real type before rendering the visualization. We are concerned with issues that arise from the various types of data, and use the more general term “Table Visualization.” These visualizations can also be called “Array Visualizations” because all the data are of the same type. Table Visualization data is not hierarchical. It does not explicitly contain internal structure or links. The data has a finite size (N and M are bounded). The data can be viewed as M points having N dimensions or features. The order of the table can sometimes be considered another dimension, which is an ordered sequence of integer values from 1 to M. If the table represents points in some other sequence such as a time series, that information should be represented as another column.