Fault-Tolerance of Parallel Volume Rendering on Cluster of PCs

In this paper we address a very important issue in parallel rendering systems, reliability. Distributed systems, such as clusters of PCs, are low-cost alternatives for running parallel rendering systems. However, distributed systems are usually not reliable, machines can fail during the rendering process, resulting in incomplete final images. Therefore, our goal is to take advantage of specific features of the parallel rendering applications, like tile-based computation, to include mechanisms to dynamically detect machine failure and automatically process tasks retrieval, with low overhead and no extra hardware. We developed three different parallel rendering systems, all based on the Parallel ZSweep algorithm[5], to provide fault-tolerance in different ways. Our experimental results show that the three systems present a small overhead to detect the failures, and when a failure occurs, the redistribution of the work does not degrade the system performance. We conclude that it is possible to provide fault-tolerance at low-cost in a cluster of PCs.

[1]  Evan Marcus,et al.  Blueprints for high availability , 2000 .

[2]  Cláudio T. Silva,et al.  Out-of-core sort-first parallel rendering for cluster-based tiled displays , 2002, Parallel Comput..

[3]  Kwan-Liu Ma,et al.  Mutli-threaded Rendering of Unstructured-Grid Volume Data on the SGI Origin 2000 , 2000 .

[4]  Arie E. Kaufman,et al.  Accelerated ray-casting for curvilinear volumes , 1998, Proceedings Visualization '98 (Cat. No.98CB36276).

[5]  Kwan-Liu Ma,et al.  A PC cluster system for simultaneous interactive volumetric modeling and visualization , 2003, IEEE Symposium on Parallel and Large-Data Visualization and Graphics, 2003. PVG 2003..

[6]  Richard P. Martin,et al.  Using Fault Injection and Modeling to Evaluate the Performability of Cluster-Based Services , 2003, USENIX Symposium on Internet Technologies and Systems.

[7]  Thomas A. Funkhouser,et al.  Load balancing for multi-projector rendering systems , 1999, Workshop on Graphics Hardware.

[8]  Ravishankar K. Iyer,et al.  Chameleon: A Software Infrastructure for Adaptive Fault Tolerance , 1999, IEEE Trans. Parallel Distributed Syst..

[9]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[10]  Marc Levoy,et al.  Volume rendering on scalable shared-memory MIMD architectures , 1992, VVS.

[11]  Santosh K. Shrivastava,et al.  Using application specific knowledge for configuring object replicas , 1996, Proceedings of International Conference on Configurable Distributed Systems.

[12]  Kwan-Liu Ma,et al.  Parallel volume ray-casting for unstructured-grid data on distributed-memory architectures , 1995, PRS.

[13]  Steven Tuecke,et al.  The Anatomy of the Grid , 2003 .

[14]  Cláudio T. Silva,et al.  Parallelizing the ZSWEEP Algorithm for Distributed-Shared Memory Architectures (ST) , 2001, VG.

[15]  J. Challinger Scalable parallel volume raycasting for nonrectilinear computational grids , 1993, Proceedings of 1993 IEEE Parallel Rendering Symposium.

[16]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[17]  Keith Marzullo,et al.  Open Grid: A User-Centric Approach for Grid Computing , 2001 .

[18]  Ricardo Farias,et al.  ZSWEEP: An Efficient and Exact Projection Algorithm for Unstructured Volume Rendering , 2000, 2000 IEEE Symposium on Volume Visualization (VV 2000).

[19]  Hermann Kopetz,et al.  Distributed fault-tolerant real-time systems: the Mars approach , 1989, IEEE Micro.