Using approximations to scale exploratory data analysis in datacubes

Exploratory Data Analysis is a widely used technique to determine which factors have the most in uence on data values in a multi-way table, or which cells in the table can be considered anomalous with respect to the other cells. In particular, median polish is a simple, yet robust method to perform Exploratory Data Analysis. Median polish is resistant to holes in the table (cells that have no values), but it may require a lot of iterations through the data. This factor makes it di cult to apply median polish to large multidimensional tables, since the I/O requirements may be prohibitive. This paper describes a technique that uses median polish over an approximation of a datacube, easing the burden of I/O. The results obtained are tested for quality, using a variety of measures. The technique scales to large datacubes and proves to give a good approximation of the results that would have been obtained by median polish in the original data.

[1]  R. Singleton A Method for Minimizing the Sum of Absolute Values of Deviations , 1940 .

[2]  J. Tukey Data analysis, computation and mathematics , 1972 .

[3]  Manuel Blum,et al.  Time Bounds for Selection , 1973, J. Comput. Syst. Sci..

[4]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[5]  J. Gentle Least absolute values estimation: an introduction , 1977 .

[6]  F. Mosteller,et al.  Understanding robust and exploratory data analysis , 1985 .

[7]  Michael Stuart,et al.  Understanding Robust and Exploratory Data Analysis , 1984 .

[8]  Frederick Mosteller,et al.  Exploring Data Tables, Trends and Shapes. , 1986 .

[9]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[10]  F. Mosteller,et al.  Exploring Data Tables, Trends and Shapes. , 1988 .

[11]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[12]  Peter J. Haas,et al.  The New Jersey Data Reduction Report , 1997 .

[13]  Kenneth A. Ross,et al.  Fast Computation of Sparse Datacubes , 1997, VLDB.

[14]  Mark Sullivan,et al.  Quasi-cubes: exploiting approximations in multidimensional databases , 1997, SGMD.

[15]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[16]  Nimrod Megiddo,et al.  Discovery-Driven Exploration of OLAP Data Cubes , 1998, EDBT.

[17]  Daniel Barbará,et al.  Quasi-Cubes: A Space-E cient Way to Support Approximate Multidimensional Databases , 1998 .

[18]  H. Hurst XLVIII. Reducing observations by the method of minimum deviations , 1930 .