Statistical Modeling of Large-Scale Scientific Simulation Data

With the advent of massively parallel computer systems, scientists are now able to simulate complex phenomena (e.g., explosions of a stars). Such scientific simulations typically generate large-scale data sets over the spatio-temporal space. Unfortunately, the sheer sizes of the generated data sets make efficient exploration of them impossible. Constructing queriable statistical models is an essential step in helping scientists glean new insight from their computer simulations. We define queriable statistical models to be descriptive statistics that (1) summarize and describe the data within a user-defined modeling error, and (2) are able to answer complex range-based queries over the spatiotemporal dimensions. In this chapter, we describe systems that build queriable statistical models for large-scale scientific simulation data sets. In particular, we present our Ad-hoc Queries for Simulation (AQSim) infrastructure, which reduces the data storage requirements and query access times by (1) creating and storing queriable statistical models of the data at multiple resolutions, and (2) evaluating queries on these models of the data instead of the entire data set. Within AQSim, we focus on three simple but effective statistical modeling techniques. AQSim's first modeling technique (called univariate mean modeler) computes the ''true'' (unbiased) mean of systematic partitions of the data. AQSim'smore » second statistical modeling technique (called univariate goodness-of-fit modeler) uses the Andersen-Darling goodness-of-fit method on systematic partitions of the data. Finally, AQSim's third statistical modeling technique (called multivariate clusterer) utilizes the cosine similarity measure to cluster the data into similar groups. Our experimental evaluations on several scientific simulation data sets illustrate the value of using these statistical models on large-scale simulation data sets.« less

[1]  Ralph B. D'Agostino,et al.  Goodness-of-Fit-Techniques , 2020 .

[2]  Hanan Samet,et al.  Applications of spatial data structures - computer graphics, image processing, and GIS , 1990 .

[3]  Ghaleb Abdulla,et al.  Simulation data as data streams , 2004, SGMD.

[4]  Terence Critchlow,et al.  Practical lessons in supporting large-scale computational science , 1999, SGMD.

[5]  Tina Eliassi-Rad,et al.  Statistical modeling of large-scale simulation data , 2002, KDD.

[6]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[7]  R. Bowers,et al.  Numerical Modeling in Applied Physics and Astrophysics , 1991 .

[8]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[9]  Sridhar Ramaswamy,et al.  The Aqua approximate query answering system , 1999, SIGMOD '99.

[10]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[11]  Howard J. Hamilton,et al.  Knowledge discovery and measures of interest , 2001 .

[12]  Chuck Baldwin,et al.  Approximate ad-hoc query engine for simulation data , 2001, JCDL '01.

[13]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[14]  L.A. Freitag,et al.  Adaptive, Multiresolution Visualization of Large Data Sets using a Distributed Memory Octree , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[15]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[16]  Chuck Baldwin,et al.  The framework for approximate queries on simulation data , 2003, Inf. Sci..

[17]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[18]  Saso Dzeroski,et al.  Using Domain Knowledge on Population Dynamics Modeling for Equation Discovery , 2001, ECML.

[19]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[20]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[21]  Tina Eliassi-Rad,et al.  Multivariate Clustering of Large-Scale Scientific Simulation Data , 2003 .

[22]  Kaizhong Zhang,et al.  Evaluating a class of distance-mapping algorithms for data mining and clustering , 1999, KDD '99.

[23]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[24]  Tina Eliassi-Rad,et al.  The evolution of a hierarchical partitioning algorithm for large-scale scientific data: three steps of increasing complexity , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[25]  Jay L. Devore,et al.  Probability and statistics for engineering and the sciences , 1982 .

[26]  Chuck Baldwin,et al.  Multi-resolution modeling of large scale scientific simulation data , 2003, CIKM '03.

[27]  Theodore Johnson,et al.  Squashing flat files flatter , 1999, KDD '99.

[28]  M. AdelsonVelskii,et al.  AN ALGORITHM FOR THE ORGANIZATION OF INFORMATION , 1963 .

[29]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[30]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[31]  Stephen F. McCormick,et al.  Multilevel adaptive methods for partial differential equations , 1989, Frontiers in applied mathematics.

[32]  Hanan Samet,et al.  Spatial Data Structures , 1995, Modern Database Systems.