From Lawe to Hue * ~ )-------be A Statistician ’ s Reactions to KDD & DM 1

The three distinct data handling cultures (statistics, data base management and artificial intelligence) fInally show signs of convergence. Whether you name their common area “data analysis” or “knowledge discovery”, the necessary ingredients for success with ever larger data sets are identical: good data, subject areaexpertise, access to technicalknow-how in all three cultures, and a good portion of common sense. Curiously, all three cultures have been trying to avoid common sense and hide its lack behind a smoke-screen of technical formalism. Huge data sets usually are notjust more of the same, they have to be huge because they are heterogeneous, with more internal structure, such that smaller sets would not do. As a consequence, subsamples and techniques based on them, like the bootstrap, may no longer make sense. The complexity of the data regularly forces the data analyst to fashion simple, but problemand data-specific tools from basic building blocks, taken from data base management and numerical mathematics. Scaling-up of algorithms is problematic, computational complexity of many procedures explodes with increasing data size; for nrrsmi-.lD nnn.r.3nt;Annl n,..n+m.;mrr nl",.,‘+hlrmn l.sn.-..%.a "*cLLLAyI.., ~"II"tiIILl"‘Ial n"~wLAJqj a,~"LlLLr‘,m "GC"LI,T unfeasible. The human ability to inspect a dataset, or even only a meaningful of part it, breaks down far below terabyte sixes. I believe that attempts to circumvent this by “automating” some aspects of exploratory analysis are futile. The available success stories suggest that the real function of data mining and KDD is not machine discovery of interesting structures by itself, but targeted extraction and reduction of data to a size and format suitable for human inspection. By necessity, such preprocessing is ad hoc, data specific and driven by working hypotheses based on subject matter expertise and on trial and error. Statistical common sense which traps to avoid, handling of random and systematic errors, and where to stop is more important than specific techniques. The machine assistance we need to step from large to huge sets thus is an integrated computing environment that allows easy improvisation and retooling even with massive data. 1. Copyright Q 1997. American Association for Artificial Intelligence (www.aaai.org). All rights reserved. Introduction Knowledge Discovery in Databases (KDD) and Data Analysis (DA) share a common goal, namely to extract meaning from dam. The only discernible difference is that the former commonly is regarded as machine centered, the latter as centered on statistical techniques and probability. But there are signs of convergence towards a common, human-centered ..:,... -.. l.,.rl. ,:,a,, VlGW WI uulu islUGs. Note the comment by Brachman and Anand (1996, p.38): “Overall, then, we see a clear need for more emphasis on a human-centered, process-oriented analysis of KDD”. One is curiously reminded of Tukey’s (1962) plea, emphasizing the role of human judgment over that of mathematical proof in DA. It seems that in different periods each professional group has been trying to squeeze out human common sense and to hide its lack behind a smoke screen of its own technical formalism. The statistics community appears to be further n,,,.T,v ,x.T l ha .-,“., n mn:r\l;hl *,x1.1 hnn n,,...:,n,,A~,. A, :rl,n cu”Ut; “11 LllFi way, a maJ”l*ly ll”W rum &yulGBLr;u L” LllG *ur;a that DA ought to be a human-centered process, and I hope the Al community will follow suit towards a happy reunion of resources.