This paper highlights multiple shortcomings in the current design process of cyber-physical embedded systems with real-time constraints. First, shortcomings in current as well as future standards to controlling the power grid are outlined. From these economic and safety threats, we derive an immediate need to invest in research on the protection of the power grid, both from the perspective of cyber attacks and distributed control system problems. Second, current software design practice does not adequately verify and validate worst-case timing scenarios that have to be guaranteed in order to meet deadlines in safety-critical embedded systems. This equally applies to avionics and the automotive industry, both of which are increasingly requiring their suppliers to provide verifiable bounds on worst-case execution time of software. Yet, there is a lack of viable solutions that suppliers can employ. We provide an analysis of this problem that outlines directions for future research and tool development in this area, both of which are pressing issues. Third, the correctness of embedded systems is currently jeopardized by soft errors that may render control systems inoperable. In general, soft errors are increasingly a problem due to (a) smaller fabrication sizes and (b) deployment in harsh environments. Increasingly, off-the-shelf embedded processors without hardware protection against soft errors are being deployed in airplanes and cars. Meanwhile, system developers have been asked to consider the effect of soft errors in their software design, yet they lack a methodology to do so. We outline much needed research in this area. I. SECURITY CONCERNS IN THE POWER GRID The power grid represents a distributed cyber-physical system that is essential to our every-day life. Larger-scale black-outs are known to have a severe economic and safety impact, as historical events have shown. The severity in impact of power outage on our life is increasing continuously as the power distribution grid becomes more standardized and more automated. Current standardization efforts include the forthcoming IEC 61850 protocol that will eventually replace existing DNP variants and other protocols. The 61850 standard redefines the interaction between substations that provide power to, e.g., quarters of a city and control centers that coordinate power distribution to balance supply and demand. This includes an increasing trend in substation automation, mainly to increase efficiency and reduce maintenance overhead. However, substation automation poses a potential for power outages should they become the target of cyber attacks or should a distributed control system malfunction. The effects could be as small as long-lasting blackouts for regions serviced by the substation or as large as larger-scale blackouts if damage is inflicted in an orchestrated, distributed attack or cascades for technical reasons. Current DNP and future 61850 standards are deployed over regular Ethernet. The long-haul connections to control centers are typically dedicated lines and, hence, are considered safe from cyber attacks. While this assumption may not be sound, substations themselves are a more likely target as they are unmanned. Physical access within a substation (or via local wireless maintenance link at a substation) could allow attackers to affect power devices. Some protection could be provided by current systems, such as encryption at the TCP layer given that DNP and 61850 traffic is, in large, layered over TCP in practice. However, some messages have real-time requirements, which cannot be guaranteed by TCP. These messages remain extremely vulnerable to attacks as they cannot easily be encrypted given that packet transmission occurs at the link layer in current solutions. Another problem is posed by the complexity of distributed systems of substation devices that exchange sensor information and autonomously decide on actuator controls. Certain malfunctions at this level may result in loss of equipment and the previously described outages. There is an immediate need for research on the protection of critical infrastructure within the power grid to counter cyber-physical attacks and distributed control problems that may result in longer-lasting outages. We currently observe a complete absence in solutions to the problems discussed above. More so, no research focuses on these problems to date. One of the main causes is a lack of adequate simulation infrastructure to foster academic to contribute viable solutions. Hence, we recommend that a software simulation framework for the IEC 61850 standard be design at the level of substation devices, their interaction and their relation to and communication with control centers. This activity should be coordinated with a concerted effort by industry leaders providing valuable input on practicality and requirements. (The U.S. is lagging behind Europe where the CRISP project has filled this critical gap.) The resulting framework then needs to be complemented by initiatives to support follow-on research on possible cyber-attacks or distributed control problems within the power grid at the simulation level, the development of counter-measures at the software level and their integration into future standards as well as commercial deployment. II. VERIFICATION AND VALIDATION OF WORST-CASE EXECUTION TIMES Current software design for safety-critical embedded systems requires stringent compliance with coding standards to ensure safety and reliability. One example is avionics where the RTCA DO-178B standard requires coverage testing (for statements, branches and conditionals). A very important additional requirement for real-time embedded systems is predictable timing behavior of software components. In particular, so-called hard real-time embedded systems have timing constraints that must be met or the system is may malfunction. Airbus (and likely also Boeing in the near future), e.g., requires their suppliers to provide verifiable bounds on worst-case execution time (WCET) for software to be deployed on planes currently under development (Airbus 380 and Boeing 787). The automotive industry is currently considering similar requirements, and others are likely to follow. Determining bounds on the WCET of embedded software is a critically important problem for next-generation embedded realtime systems [1]. Currently, practitioners resort to testing methods to determine execution times of real-time tasks. However, testing alone cannot provide a verifiable (safe) upper bound on WCET. Exhaustive testing of inputs is generally infeasible, even for moderately complex input spaces due to its exponential complexity. In contrast to dynamic testing, static timing analysis can provide safe upper bounds on the WCET of code sections, realtime tasks or entire applications. Hence, static timing analysis provides a safer and more efficient alternative to testing [2]. It yields verifiable bounds on the WCET of tasks regardless of program input by simulating execution along the control-flow paths within the program structure while considering architectural details, such as pipelining and caching [3]. These WCET bounds should also be tight to support high utilizations when determining if tasks can meet their deadlines via schedulability analysis. Tight bounds, however, can only be obtained if the behavior of hardware components is predicated accurately, yet conservatively with respect to its worst-case behavior. Static timing analysis techniques are constantly trailing behind the innovation curve in hardware. It is becoming increasingly difficult to provide tight and safe bounds in the presence of out-of-order execution, dynamic branch prediction and speculative execution. Simulation of hardware components is also prone to inaccuracy due to lack of information about subtle details of processors. We advocate research on new approaches to bounding the WCET. Most importantly, a realistic hybrid approach is needed that combines formal static timing analysis with concrete micro-timing observations of actual architectures. First, a formal approach guarantees correctness. Second, dynamic timings on actual processors for small code sections will allow advanced embedded processor designs to be used in such time-critical systems, even in the presence of dynamic and unpredictable execution features. Third, any architectural modifications in support of such a paradigm have to be realistic in that they should reuse existing infrastructure both on the architecture side and the methodology for static timing analysis. There is an immediate need to develop software tools that can provide verifiable execution times to allow validation of task schedules within time-critical embedded systems. (The U.S. is lagging behind Europe in transfering research knowledge on WCET to products. However, the European results are also subject to trailing behind the hardware innovations curve, which underlines the need for research.) III. PROTECTION AGAINST SOFT ERRORS Transient faults are becoming an increasing concern of system design for two reasons. First, smaller fabrication sizes have resulted in lower signal/noise ratio that more frequently leads to bit flips in CMOS circuits [4]. Second, embedded systems are increasingly deployed in harsh environments causing soft errors due to lack of protection on the hardware side [5]. The former reason affects computing at large while the latter is predominantly of concern for critical infrastructure. For example, the automotive industry has used temperature-hardened processors for control tasks around the engine block while space missions use radiation-hardened processors to avoid damage from solar radiation. Current trends indicate an increasing rate of transient faults (i.e., soft errors), not only due to smaller fabs but also because embedded systems are deployed in harsh environments they were not designed for. In commercial aviation
[1]
Joel R. Sklaroff,et al.
Redundancy Management Technique for Space Shuttle Computers
,
1976,
IBM J. Res. Dev..
[2]
Edward J. McCluskey,et al.
Concurrent Error Detection Using Watchdog Processors - A Survey
,
1988,
IEEE Trans. Computers.
[3]
David B. Whalley,et al.
Bounding worst-case instruction cache performance
,
1994,
1994 Proceedings Real-Time Systems Symposium.
[4]
M. Rimen,et al.
Implicit signature checking
,
1995,
Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.
[5]
Y. C. Yeh,et al.
Triple-triple redundant 777 primary flight computer
,
1996,
1996 IEEE Aerospace Applications Conference. Proceedings.
[6]
Ying C. Yeh.
Design considerations in Boeing 777 fly-by-wire computers
,
1998,
Proceedings Third IEEE International High-Assurance Systems Engineering Symposium (Cat. No.98EX231).
[7]
Todd M. Austin,et al.
DIVA: a reliable substrate for deep submicron microarchitecture design
,
1999,
MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.
[8]
Edward J. McCluskey,et al.
Software-implemented EDAC protection against SEUs
,
2000,
IEEE Trans. Reliab..
[9]
Shubhendu S. Mukherjee,et al.
Transient fault detection via simultaneous multithreading
,
2000,
Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[10]
Dual use of superscalar datapath for transient-fault detection and recovery
,
2001,
MICRO.
[11]
Edward J. McCluskey,et al.
Error detection by duplicated instructions in super-scalar processors
,
2002,
IEEE Trans. Reliab..
[12]
John P. Hayes,et al.
Low-cost on-line fault detection using control flow assertions
,
2003,
9th IEEE On-Line Testing Symposium, 2003. IOLTS 2003..
[13]
Cristian Constantinescu,et al.
Trends and Challenges in VLSI Circuit Reliability
,
2003,
IEEE Micro.
[14]
Irith Pomeranz,et al.
Transient-Fault Recovery for Chip Multiprocessors
,
2003,
IEEE Micro.
[15]
H. Ando,et al.
A 1.3GHz fifth generation SPARC64 microprocessor
,
2003,
Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).
[16]
Frank Müller,et al.
Timing Analysis for Instruction Caches
,
2000,
Real-Time Systems.
[17]
Joachim Wegener,et al.
A Comparison of Static Analysis and Evolutionary Testing for the Verification of Timing Constraints
,
2004,
Real-Time Systems.
[18]
M. Kandemir,et al.
Using loop invariants to fight soft errors in data caches
,
2005,
Proceedings of the ASP-DAC 2005. Asia and South Pacific Design Automation Conference, 2005..
[19]
David I. August,et al.
Design and evaluation of hybrid fault-detection systems
,
2005,
32nd International Symposium on Computer Architecture (ISCA'05).
[20]
David I. August,et al.
SWIFT: software implemented fault tolerance
,
2005,
International Symposium on Code Generation and Optimization.
[21]
Mahmut T. Kandemir,et al.
Compiler-directed instruction duplication for soft error detection
,
2005,
Design, Automation and Test in Europe.
[22]
Wei Zhang,et al.
Compiler-guided register reliability improvement against soft errors
,
2005,
EMSOFT.
[23]
Mahmut T. Kandemir,et al.
Memory Space Conscious Loop Iteration Duplication for Reliable Execution
,
2005,
SAS.
[24]
Narayanan Vijaykrishnan,et al.
Reliability concerns in embedded system designs
,
2006,
Computer.