Processing large-scale multi-dimensional data in parallel and distributed environments

Analysis of data is an important step in understanding and solving a scientific problem. Analysis involves extracting the data of interest from all the available raw data in a dataset and processing it into a data product. However, in many areas of science and engineering, a scientist's ability to analyze information is increasingly becoming hindered by dataset sizes. The vast amount of data in scientific datasets makes it a difficult task to efficiently access the data of interest, and manage potentially heterogeneous system resources to process the data. Subsetting and aggregation are common operations executed in a wide range of data-intensive applications. We argue that common runtime and programming support can be developed for applications that query and manipulate large datasets. This paper presents a compendium of frameworks and methods we have developed to support efficient execution of subsetting and aggregation operations in applications that query and manipulate large, multi-dimensional datasets in parallel and distributed computing environments.

[1]  Rajeev Thakur,et al.  Passion: Optimized I/O for Parallel Applications , 1996, Computer.

[2]  Nick Roussopoulos,et al.  MOCHA: a self-extensible database middleware system for distributed data sources , 2000, SIGMOD 2000.

[3]  Joel H. Saltz,et al.  Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures , 1994, J. Parallel Distributed Comput..

[4]  Christos Faloutsos,et al.  Active Storage for Large-Scale Data Mining and Multimedia , 1998, VLDB.

[5]  Joel H. Saltz,et al.  Tuning the performance of I/O-intensive parallel applications , 1996, IOPADS '96.

[6]  William Schroeder,et al.  The Visualization Toolkit: An Object-Oriented Approach to 3-D Graphics , 1997 .

[7]  Joel H. Saltz,et al.  Infrastructure for Building Parallel Database Systems for Multi-Dimensional Data , 1999, IPPS/SPDP.

[8]  Cláudio T. Silva,et al.  Out-Of-Core Rendering of Large, Unstructured Grids , 2001, IEEE Computer Graphics and Applications.

[9]  Joel H. Saltz,et al.  Visualization of Large Data Sets with the Active Data Repository , 2001, IEEE Computer Graphics and Applications.

[10]  Mohammed J. Zaki,et al.  Parallel classification for data mining on shared-memory multiprocessors , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[11]  Joel H. Saltz,et al.  Object-Relational Queries into Multidimensional Databases with the Active Data Repository , 1999, Parallel Process. Lett..

[12]  Joel H. Saltz,et al.  Titan: a high-performance remote-sensing database , 1997, Proceedings 13th International Conference on Data Engineering.

[13]  Chau-Wen Tseng,et al.  Improving Compiler and Run-Time Support for Irregular Reductions Using Local Writes , 1998, LCPC.

[14]  E LorensenWilliam,et al.  Marching cubes: A high resolution 3D surface construction algorithm , 1987 .

[15]  William E. Johnston,et al.  A distributed parallel storage architecture and its potential application within EOSDIS , 1995 .

[16]  T. Kurc,et al.  Querying Very Large Multi-dimensional Datasets in ADR , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[17]  Joel H. Saltz,et al.  Compiling object-oriented data intensive applications , 2000, ICS '00.

[18]  T. Tanaka,et al.  Configurations of the solar wind flow and magnetic field around the planets with no magnetic field : calculation by a new MHD simulation scheme , 1993 .

[19]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[20]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[21]  Alok N. Choudhary,et al.  PARSIMONY: An Infrastructure for Parallel Multidimensional Analysis and Data Mining , 2001, J. Parallel Distributed Comput..

[22]  Joel H. Saltz,et al.  DataCutter: Middleware for Filtering Very Large Scientific Datasets on Archival Storage Systems , 2000, IEEE Symposium on Mass Storage Systems.

[23]  Marc J. Teller,et al.  Petabyte File Systems Based on Tertiary Storage , 1998 .

[24]  J. Townshend,et al.  Fast algorithms for removing atmospheric effects from satellite images , 1996 .

[25]  Alan Watt,et al.  Fundamentals of three-dimensional computer graphics , 1989 .

[26]  David A. Patterson,et al.  A case for intelligent disks (IDISKs) , 1998, SGMD.

[27]  Joel H. Saltz,et al.  Coupling Multiple Simulations via a High Performance Customizable Database System , 1999, PPSC.

[28]  Karsten Schwan,et al.  dQCOB: managing large data flows using dynamic embedded queries , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[29]  Joel H. Saltz,et al.  Active disks: programming model, algorithms and evaluation , 1998, ASPLOS VIII.

[30]  David A. Patterson,et al.  ISTORE: introspective storage for data-intensive network services , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[31]  Karsten Schwan,et al.  ACDS: Adapting computational data streams for high performance , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[32]  Nick Roussopoulos,et al.  MOCHA: a self-extensible database middleware system for distributed data sources , 2000, SIGMOD '00.

[33]  Ron Oldfield,et al.  Armada: a parallel file system for computational grids , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[34]  Joel H. Saltz,et al.  Decision Tree Construction for Data Mining on Cluster of Shared-Memory Multiprocessors , 2001 .

[35]  David Kotz,et al.  The galley parallel file system , 1996, ICS '96.

[36]  Dror G. Feitelson,et al.  The Vesta parallel file system , 1996, TOCS.

[37]  R. Luettich,et al.  ADCIRC: An Advanced Three-Dimensional Circulation Model for Shelves, Coasts, and Estuaries. Report 6. Development of a Tidal Constituent Database for the Eastern North Pacific. , 1994 .

[38]  Michael E. Papka,et al.  Large-Scale Data Visualization Using Parallel Data Streaming , 2001, IEEE Computer Graphics and Applications.

[39]  Gregory R. Ganger,et al.  Dynamic Function Placement in Active Storage Clusters , 1999 .

[40]  Geoffrey C. Fox,et al.  Runtime Support and Compilation Methods for User-Specified Irregular Data Distributions , 1995, IEEE Trans. Parallel Distributed Syst..

[41]  Joel H. Saltz,et al.  Performance optimization for data intensive grid applications , 2001, Proceedings Third Annual International Workshop on Active Middleware Services.

[42]  William E. Lorensen,et al.  Marching cubes: a high resolution 3D surface construction algorithm , 1996 .

[43]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[44]  Kwan-Liu Ma,et al.  Out-of-Core Streamline Visualization on Large Unstructured Meshes , 1997, IEEE Trans. Vis. Comput. Graph..

[45]  William E. Lorensen,et al.  The visualization toolkit (2nd ed.): an object-oriented approach to 3D graphics , 1998 .

[46]  David Kotz,et al.  Disk-directed I/O for MIMD multiprocessors , 1994, OSDI '94.

[47]  Joel H. Saltz,et al.  Optimizing execution of component-based applications using group instances , 2002, Future Gener. Comput. Syst..

[48]  Mary F. Wheeler,et al.  Parallel computing in environment and energy , 2003 .

[49]  Joel H. Saltz,et al.  Digital dynamic telepathology-the Virtual Microscope , 1998, AMIA.

[50]  Curtis E. A. Karnow,et al.  The Grid: Blueprint for a New Computing Infrastructure ed. by Ian Foster and Carl Kesselman (review) , 2017 .

[51]  Joel H. Saltz,et al.  Evaluation of active disks for decision support databases , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[52]  AykanatCevdet,et al.  Hypergraph-Partitioning-Based Decomposition for Parallel Sparse-Matrix Vector Multiplication , 1999 .

[53]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[54]  Christos Faloutsos,et al.  Declustering using fractals , 1993, [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems.

[55]  V. Pascucci,et al.  Parallel accelerated isocontouring for out-of-core visualization , 1999, Proceedings 1999 IEEE Parallel Visualization and Graphics Symposium (Cat. No.99EX381).

[56]  Paul H. Smith,et al.  Data and Visualization Corridors: Report on the 1998 DVC Workshop Series , 1998 .

[57]  Joel H. Saltz,et al.  Performance impact of proxies in data intensive client-server applications , 1999, ICS '99.

[58]  Mohammed J. Zaki,et al.  Parallel Classi cation for Data Mining on Shared-Memory Multiprocessors , 1998 .

[59]  Mary W. Hall,et al.  Detecting Coarse - Grain Parallelism Using an Interprocedural Parallelizing Compiler , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[60]  Ümit V. Çatalyürek,et al.  Hypergraph-Partitioning-Based Decomposition for Parallel Sparse-Matrix Vector Multiplication , 1999, IEEE Trans. Parallel Distributed Syst..

[61]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[62]  Joel H. Saltz,et al.  A Performance Prediction Framework for Data Intensive Applications on Large Scale Parallel Machines , 1998, LCR.

[63]  Peter Brezany,et al.  Parallelization of Irregular Codes Including Out-of-Core Data and Index Arrays , 1997, PARCO.

[64]  Joel H. Saltz,et al.  Scalability Analysis of Declustering Methods for Multidimensional Range Queries , 1998, IEEE Trans. Knowl. Data Eng..

[65]  Joel H. Saltz,et al.  A Hypergraph-Based Workload Partitioning Strategy for Parallel Data Aggregation , 2001, PPSC.

[66]  T. Cole,et al.  User's guide to the CE-QUAL-ICM three-dimensional eutrophication model : release version 1.0 , 1995 .

[67]  Joel H. Saltz,et al.  Optimizing retrieval and processing of multi-dimensional scientific datasets , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.