Visualisation for Data Mining

Modern computing power makes possible analysis of larger and larger data sets and many new methods have been suggested under the broad heading of Data Mining. Visualisation of data, of model-fitting, and of results plays an important part, but large data sets are different and new methods of display are needed for dealing with them. This paper reviews the standard problems in displaying large numbers of cases and variables, both continuous and categorical, and emphasises the need for improving current software. Much could be achieved by adding interactive tools like querying, linking and sorting to standard displays to provide greater flexibility and to facilitate a more exploratory approach. 1 What is Data Mining? Large data sets are more and more common. Every organisation is able to collect and store vast quantities of information. Supermarkets have sales figures for individual items and for customers. Phone companies have details of every phone call made. Weather computers store records of all manner of meteorological data. Websites try to monitor internet usage. And so on and so on. There is no point in maintaining data sets unless some attempt is made to get information out of them. Statisticians have always analysed large data sets, but what is meant by large has changed over the years with the increasing power of computers. Analyses which took months by hand fifty years ago can now be carried out in a second. Much larger data sets can be considered and new problems have arisen in consequence. Some standard statistical methods do not scale up well to the big data sets to be analysed nowadays. New ideas and new approaches are needed. One term which has been heard more and more often in this connection in recent years is Data Mining. It is so new, that not all are agreed what it might mean. David Hand has suggested that any definition should include the qualification that Data Mining is usually applied to data sets which have been collected for another purpose, that, in other words, Data Mining analyses are secondary analyses of data. This has implications for the quality of the data and for the difficulties of interpreting and generalising any results obtained. Results should not be reported as if they were based on random samples Visualisation for Data Mining Unwin Seoul, December 2000 2 froma population of interest. Another unexpected characteristic of Data Mining to be born in mind is that the “best” results are not likely to be the ones that are of most interest. The strongest results will either be known already or superficially obvious. The results which were previously unknown and do not stand out require more careful elicitation and will appear further down any list of outputs from Data Mining analyses. This suggests combining both aspects of Data Mining in the following definition: Data Mining is the secondary analysis of large data sets looking for secondary results. Computer scientists use Data Mining to describe methods which automatically search data sets for “interesting information”. Statisticians tend to use the term with a slightly negative tone to describe searching large data sets for anything of interest. Both groups have an important part to play: the computer scientists contribute fast and efficient methods for exploring the data; the statisticians contribute ways of assessing “interestingness” and the strengths of statistical principles of data analysis. It cannot be emphasised enough that every reliable optimal search algorithm will produce an optimum, but that does not mean that the result produced is worth considering. 2 The importance of visualisation One good way of assessing the value of results is to examine them visually and that should be a major application of visualisation methods in Data Mining. The phrase “should be” is used advisedly, as graphics are used far less than they might be, both at this stage of analysis and also at other stages where they might play a part: in investigating data quality, in identifying patterns or in suggesting structures. There are several possible reasons for this state of affairs. Statistical graphics are not underpinned by a formal theory, but are more a collection of useful tools than a solid structure which can be built on. (Lee Wilkinson’s book. “The Grammar of Graphics” and Adi Wilhelm’s research have only been published in the last year.) Graphics software for exploratory analyses is unsatisfactory. Software tends to concentrate on presentation graphics, which are unsuited for exploratory work. The available graphics software also tends to be poor for large data sets. Little effort has been made to develop graphics for the scale of problems met in modern data analysis. This paper discusses responses to some of these arguments. Ways of scaling up classical graphics are described and software which enables effective exploratory analyses of large data sets will be illustrated. The visual approach complements more analytic methods and should be an essential component of Data Mining studies. Note that the displays considered here are for the raw data, so to speak, and attempt to work with the full dimensionality of the data set. There are other approaches (for instance biplots or projection pursuit displays) whose aim is to find informative views in new, lower dimensional spaces. Dimensionreduction methods are not discussed here. Visualisation for Data Mining Unwin Seoul, December 2000 3 Data visualisation should be central to Data Mining for another reason. Traditional statistical modelling assumes a clear goal to be achieved. Similarly, automatic search engines assume a stated optimisation criterion. Yet in Data Mining there are no specific goals, just the avowed aim to get some information — any real information — out of the data. Goals of analysis can include the identification of outliers, the definition of particular groups or clusters, and both deep analysis of smaller subgroups or sweeping generalisations about the whole data set. Visualisation is a flexible approach which encourages the consideration of several goals in parallel. It is therefore ideal for Data Mining. As an example of the problems of displaying large data sets, consider the two scatterplots in Figure 1. Both show profit against amount for financial deals carried out by a bank. There are almost 1000000 points in the data set but for explanatory purposes the left hand plot is based on a subset of 17243 data points, while the plot on the right shows 9095 points from that subset close to (0,0). The right hand plot was drawn by reducing the y axis scale by a factor of 10 and the x axis by a factor of 10. Figure1 Profit against amount traded for financial transactions in a bank. Scales have been removed for reasons of confidentiality. There is no ideal static solution for these data, but a combination of querying, rescaling, zooming, linking and making use of multiple plots enables the information in the variables to be found relatively easily. As is typical of large data sets, there are many different pieces of information to be found (the outliers in the left-hand plot, the variety of linear relationships near (0,0) in the right-hand plot, clusterings of points etc) and many different views are required. The important thing is that these views can be generated quickly and flexibly to match the wide variety of possible features that might be in the data. Many of the weaknesses of graphic displays in relation to large data sets can be got round with a combination of relatively minor adjustments and by making use of interaction as just suggested (though some displays, like stem and leaf plots do not scale up at all). A discussion of the basic interactive tools that should be available for any plot may be found in Unwin [1999]. Visualisation for Data Mining Unwin Seoul, December 2000 4 Although the ideas of interactive statistical graphics have been around for some time (there was a good collection of articles published in Cleveland and McGill as early as 1988), they have not yet come into common use. The reason is simple. It is hard to write good interactive graphics software and thus little is available. This is a great pity, as it has held back progress in the area. As Swayne and Klinke remark in their introduction to the recent special issue of Computational Statistics [Vol 14, #1 1999] devoted to interactive statistics, it is surprising how little some statisticians require of a system to call it interactive. 3 Statistical graphic displays for large data sets 3.1 Displays for single continuous variables The common graphic displays all have weaknesses in displaying large data sets. Dotplots are good for data sets with up to perhaps 100 cases, mainly for identifying individual cases and for showing any gaps in the data. With larger numbers of cases it is impossible to identify individuals and there are rarely any gaps. Boxplots are useful for identifying outliers, but consider what happens as a data set grows. An empirical distribution of 100 points might have 2 or 3 outliers, but a much larger sample of say 100,000 from the same kind of data distribution might then have 2000 to 3000 outliers. The advantages of individual identification are then lost. Histograms may be regarded either as a crude form of density estimation or as a data analysis tool. Recommendations on bin-width or numbers of bins based on various theoretical considerations may be found in Scott [1992]. For instance, for the x varable in the data set of 17243 cases in Figure 1 his recommendation would be 586 bins! This would permit about 1 pixel width per bin on a laptop screen. Of course, the problem is the extreme outliers and most of the 586 bins would be empty. Without the top 3 outliers the number of bins recommended goes down to 548, although that is still excessive. Without the top 12 outliers it goes down to 161. Rather than relying on such theoretical models it makes more sense to experimen