论文信息 - RAS Modeling of a Large InfiniBand Switch System

RAS Modeling of a Large InfiniBand Switch System

Computer clusters or grids constructed from open and standard commercial off the shelf (COTS) systems now dominate the top 500 supercomputer sites (Top500, 2008), providing an attractive way to rapidly construct high performance computing (HPC) systems of interconnected nodes. The largest of these HPC systems are now driving toward petascale deployments, delivering petaflops of computational capacity and petabytes of storage capacity. However, designing and building these large HPC systems involves significant challenges, including:  Rapidly building and expanding the computational capacity of HPC clusters to meet growing demands  Increasing levels of computational density while staying within constrained envelopes of power and cooling  Reducing complexity and cost for physical infrastructure and management  Implementing interconnect technology that can connect hundreds or thousands of processors without introducing unacceptable levels of latency

Dong Tang | Ola Torudbakken

[1] Dong Tang,et al. Automatic generation of availability models in RAScad , 2002, Proceedings International Conference on Dependable Systems and Networks.

[2] Kishor S. Trivedi,et al. Hierarchical computation of interval availability and related metrics , 2004, International Conference on Dependable Systems and Networks, 2004.

[3] Dong Tang,et al. Optimizing service strategy for systems with deferred repair , 2005, 11th Pacific Rim International Symposium on Dependable Computing (PRDC'05).

[4] Liang Yin,et al. Hierarchical composition and aggregation of state-based availability and performability models , 2003, IEEE Trans. Reliab..