IT is with great pleasure that we introduce this special section on System-Level Design of Reliable Architectures to the audience of the IEEE Transactions on Computers. Six papers have been selected covering a wide spectrum of topics ranging from architectural fault-tolerant techniques to formal methodologies for reliability analysis. These papers are authored by relevant researchers in the field and cover theoretical and experimental topics. The widespread use of electronics in our life is directing more and more attention to the reliability properties of such systems in order to preserve both user’s and environmental safety; therefore, the design of reliable architectures is today a necessity rather than an option, even in not-critical application domains. At the same time, these systems are reaching high complexity levels, thus leading the designer to both develop specific components and to use and compose existing ones to achieve the desired overall functionality. In the former case, ad hoc techniques may be devised, acting on either the hardware or the software to cope with the occurrence of faults. In this latter situation, when combining independently designed modules, the enhancement and assessment of reliability becomes particularly important; for instance, specific approaches are required to be able both to apply fault detection/tolerance techniques from the initial steps of the design flow and to evaluate the effects of faults in a component while interacting with the other ones composing the overall system. As a result, the entire design flow needs to be enhanced to support reliability: from the initial modelling of the system together with the desired properties/requirements, to the fault model, from the hardware/ software partitioning step to the subsequent design exploration phase, where the more traditional metrics covering performance, costs, and power consumption need to be modified to also weight fault detection/tolerance capabilities. Functional verification and reliability analysis constitute two other aspects of this scenario to assess the quality of the designed system in terms of correctness and its ability to deal with failures. In this scenario, new advances have been achieved in all the relevant issues pertaining the system-level design of reliable systems, to support the designers in the development of innovative architectures able to cope with the occurrence of failures. Such advances lead to the definition of both new methodologies, as well as, of new architectures. Furthermore, based on the application environment in which the system will be adopted, different classes of reliability might be necessary; in some situations it is possible to achieve an autonomous fault detection capability, whereas, in critical environments, fault effects need to be completely masked, thus providing fault tolerance properties. The six papers presented in this special section were selected to address the different aspects of the important challenges related to the system level design of reliable systems. They cover all various facets of the issue, offering interesting solutions to tackle the specific problems. The first two papers deal with reliability analysis, which has become a fundamental tool to computer engineers for the validation of the design of hardened system architectures, in particular in safety and mission critical domains, such as medicine, military and transportation. The first paper is entitled “Formal Reliability Analysis Using Theorem Proving” by Osman Hasan, Sofiene Tahar, and Naeem Abbasi. This paper addresses an important aspect of reliability analysis, attempting to introduce formal verification instead of simulation-based and probabilistic approaches to assess the fault tolerance characteristics of the designed systems. The authors propose to conduct a formal reliability analysis of systems within the framework of a higher-order-logic theorem prover. In this paper, they present the higher-orderlogic formalization of some fundamental reliability theory concepts, which can be built upon to precisely analyze the reliability of various engineering systems. The proposed formalization is then applied to analyze the repairability conditions for a reconfigurable memory array in the presence of stuck-at and coupling faults. Still within the context of reliability analysis, the second paper, entitled “Efficient Microarchitectural Vulnerabilities Prediction Using Boosted Regression Trees and Patient Rule Inductions,” by Bin Li, Lide Duan, and Lu Peng, deals with Architectural Vulnerability Factor (AVF) analysis, which reflects the possibility that a transient fault eventually causes a visible error in the program output, and it indicates a system’s susceptibility to transient faults. This metric is increasingly being adopted to evaluate microprocessor’s architectures, due to their high vulnerability to transient faults, derived from shrinking feature sizes, threshold voltage, and increasing frequency. The authors propose an innovative way to predict the architectural vulnerability factor using Boosted Regression Trees, a nonparametric tree-based predictive modeling scheme, to identify the correlation across workloads, execution phases, and processor configurations, between the estimated AVF of a key processor structure and various performance metrics. The next two papers deal with fault detection techniques for different architectural components. The first paper is entitled “Concurrent Structure-Independent Fault Detection Schemes for the Advanced Encryption Standard,” authored by Mehran Mozaffari-Kermani and Arash Reyhani-Masoleh. IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 5, MAY 2010 577
[1]
D. M. Blough.
Performance evaluation of a reconfiguration-algorithm for memory arrays containing clustered faults
,
1996,
IEEE Trans. Reliab..
[2]
John P. Hayes,et al.
Accurate reliability evaluation and enhancement via probabilistic transfer matrices
,
2005,
Design, Automation and Test in Europe.
[3]
Sandeep K. Shukla,et al.
Evaluating the reliability of NAND multiplexing with PRISM
,
2005,
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
[4]
David J. C. Mackay,et al.
Introduction to Monte Carlo Methods
,
1998,
Learning in Graphical Models.
[5]
W. Kent Fuchs,et al.
Probabilistic analysis and algorithms for reconfiguration of memory arrays
,
1992,
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..
[6]
Janak H. Patel,et al.
Diagnosis and Repair of Memory with Coupling Faults
,
1989,
IEEE Trans. Computers.
[7]
Sandeep K. Shukla,et al.
NANOPRISM: a tool for evaluating granularity vs. reliability trade-offs in nano architectures
,
2004,
GLSVLSI '04.
[8]
Aarti Gupta,et al.
Formal hardware verification methods: A survey
,
1992,
Formal Methods Syst. Des..
[9]
H. W. Leong,et al.
Probabilistic analysis of memory reconfiguration in the presence of coupling faults
,
1992,
Proceedings 1992 IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems.
[10]
Kartik Mohanram,et al.
Reliability Analysis of Logic Circuits
,
2009,
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
[11]
Lorena Anghel,et al.
A diversified memory built-in self-repair approach for nanotechnologies
,
2004,
22nd IEEE VLSI Test Symposium, 2004. Proceedings..
[12]
Michael J. C. Gordon,et al.
Mechanizing programming logics in higher order logic
,
1989
.
[13]
Jianbo Gao,et al.
Faults, error bounds and reliability of nanoelectronic circuits
,
2005,
2005 IEEE International Conference on Application-Specific Systems, Architecture Processors (ASAP'05).
[14]
Sofiène Tahar,et al.
Formal Probabilistic Analysis of Stuck-at Faults in Reconfigurable Memory Arrays
,
2009,
IFM.
[15]
Minsu Choi,et al.
Hardware-software Co-reliability in field reconfigurable multi-processor-memory systems
,
2002,
Proceedings 16th International Parallel and Distributed Processing Symposium.
[16]
W. Kent Fuchs,et al.
Efficient Spare Allocation for Reconfigurable Arrays
,
1987
.
[17]
Erik Jan Marinissen,et al.
Redundancy modelling and array yield analysis for repairable embedded memories
,
2005
.
[18]
Joseph R. Cavallaro,et al.
A survey of NASA and military standards on fault tolerance and reliability applied to robotics
,
1994
.