Survey and future directions of fault-tolerant distributed computing on board spacecraft

Abstract Current and future space missions demand highly reliable on-board computing systems, which are capable of carrying out high-performance data processing. At present, no single computing scheme satisfies both, the highly reliable operation requirement and the high-performance computing requirement. The aim of this paper is to review existing systems and offer a new approach to addressing the problem. In the first part of the paper, a detailed survey of fault-tolerant distributed computing systems for space applications is presented. Fault types and assessment criteria for fault-tolerant systems are introduced. Redundancy schemes for distributed systems are analyzed. A review of the state-of-the-art on fault-tolerant distributed systems is presented and limitations of current approaches are discussed. In the second part of the paper, a new fault-tolerant distributed computing platform with wireless links among the computing nodes is proposed. Novel algorithms, enabling important aspects of the architecture, such as time slot priority adaptive fault-tolerant channel access and fault-tolerant distributed computing using task migration are introduced.

[1]  Roberto Baldoni,et al.  Asynchronous active replication in three-tier distributed systems , 2002, 2002 Pacific Rim International Symposium on Dependable Computing, 2002. Proceedings..

[2]  Alan D. George,et al.  Reconfigurable Fault Tolerance: A Comprehensive Framework for Reliable and Adaptive FPGA-Based Space Computing , 2012, TRETS.

[3]  John R. Samson Implementation of a Dependable Multiprocessor CubeSat , 2011, 2011 Aerospace Conference.

[4]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[5]  Manfred Zink,et al.  The missions TerraSAR-X and TanDEM-X: Status, challenges, future perspectives , 2011, 2011 XXXth URSI General Assembly and Scientific Symposium.

[6]  Roberto Baldoni,et al.  Fully distributed three-tier active software replication , 2006, IEEE Transactions on Parallel and Distributed Systems.

[7]  High-performance, Dependable Multiprocessor , 2006, 2006 IEEE Aerospace Conference.

[8]  Karl-Erwin Großpietsch,et al.  Fault tolerance , 1994, IEEE Micro.

[9]  M. Patel,et al.  Technology Validation: NMP ST8 Dependable Multiprocessor Project II , 2007, 2007 IEEE Aerospace Conference.

[10]  Arvindra Sehmi,et al.  On Distributed Embedded Systems , 2013 .

[11]  Peter Alan Lee,et al.  Fault Tolerance , 1990, Dependable Computing and Fault-Tolerant Systems.

[12]  Tanya Vladimirova,et al.  Wireless Fault-Tolerant Distributed Architecture for Satellite Platform Computing , 2012, ICHIT.

[13]  Ulrich Schmid,et al.  Formally verified Byzantine agreement in presence of link faults , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[14]  Fred B. Schneider,et al.  Implementing trustworthy services using replicated state machines , 2005, IEEE Security & Privacy Magazine.

[15]  H. Pham Optimal design of hybrid fault-tolerant computer systems , 1992 .

[16]  D. Nguyen,et al.  Recovery blocks in real-time distributed systems , 1998, Annual Reliability and Maintainability Symposium. 1998 Proceedings. International Symposium on Product Quality and Integrity.

[17]  John P. Hayes,et al.  Low-cost sensing with ring oscillator arrays for healthier reconfigurable systems , 2012, TRETS.

[18]  N. K. Jha Fault-tolerant computer system design [Book Reviews] , 1996 .

[19]  David Chek Ling Ngo,et al.  A reliable infrastructure based on COTS technology for affordable space application , 2001, 2001 IEEE Aerospace Conference Proceedings (Cat. No.01TH8542).

[20]  Aviziens Fault-Tolerant Systems , 1976, IEEE Transactions on Computers.

[21]  John F. Wakerly,et al.  Transient Failures in Triple Modular Redundancy Systems with Sequential Modules , 1975, IEEE Transactions on Computers.

[22]  D.A. Rennels Architectures for fault-tolerant spacecraft computers , 1978, Proceedings of the IEEE.

[23]  Chryssis Georgiou,et al.  Cooperative Task-Oriented Computing: Algorithms and Complexity , 2011, Synthesis Lectures on Distributed Computing Theory.

[24]  William Gropp,et al.  Beowulf Cluster Computing with Linux , 2003 .

[25]  Zhu Bocheng,et al.  Architecture design of spaceborne SAR imaging processing system , 2010, IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS.

[26]  Gwangil Jeon,et al.  Resource-conscious customization of CORBA for CAN-based distributed embedded systems , 2000, Proceedings Third IEEE International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC 2000) (Cat. No. PR00607).

[27]  Jaynarayan H. Lala,et al.  Advanced Information Processing System (AIPS)-based fault tolerant avionics architecture for launch vehicles , 1990, 9th IEEE/AIAA/NASA Conference on Digital Avionics Systems.

[28]  Bernard Pottier,et al.  Guest editorial CAPA'08 configurable computing: Configuring algorithms, processes, and architecture issue I: Configuring algorithms and processes , 2009, TECS.

[29]  Xavier Défago,et al.  Semi-passive replication and Lazy Consensus , 2004, J. Parallel Distributed Comput..

[30]  Mengu Cho,et al.  Reconfigurable fault tolerant avionics system , 2013, 2013 IEEE Aerospace Conference.

[31]  M. Patel,et al.  High Performance Dependable Multiprocessor II , 2007, 2007 IEEE Aerospace Conference.

[32]  K. H. Kim,et al.  A distributed fault tolerant architecture for nuclear reactor and other critical process control applications , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[33]  V. K. Agrawal,et al.  A Real Time Fault Tolerant Microprocessor Based On-Board Computer System for INSAT-2 Spacecraft , 1994, FTRTFT.

[34]  Tipp Moseley,et al.  PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures , 2009, IEEE Transactions on Dependable and Secure Computing.

[35]  Manfred Zink,et al.  TanDEM-X mission: Overview, challenges and status , 2013, 2013 IEEE International Geoscience and Remote Sensing Symposium - IGARSS.

[36]  Michael N. Lovellette,et al.  Strategies for fault-tolerant, space-based computing: Lessons learned from the ARGOS testbed , 2002, Proceedings, IEEE Aerospace Conference.

[37]  M. Wirthlin,et al.  Improving FPGA Design Robustness with Partial TMR , 2006, 2006 IEEE International Reliability Physics Symposium Proceedings.

[38]  Hermann Kopetz,et al.  Fault-Tolerant Membership Service in a Synchronous Distributed Real-Time System , 1991 .

[39]  Farnam Jahanian,et al.  A Real-Time Primary-Backup Replication Service , 1999, IEEE Trans. Parallel Distributed Syst..

[40]  A. Moreira,et al.  TOPMEX-9 distributed SAR mission employing nanosatellite cluster , 2012 .

[41]  Ting Peng,et al.  OBC-NG: Towards a reconfigurable on-board computing architecture for spacecraft , 2014, 2014 IEEE Aerospace Conference.

[42]  Carlos Villalpando,et al.  Reliable multicore processors for NASA space missions , 2011, 2011 Aerospace Conference.

[43]  Yves Sorel,et al.  An Active Replication Scheme that Tolerates Failures in Distributed Embedded Real-Time Systems , 2004, DIPES.

[44]  Kewal K. Saluja,et al.  Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[45]  André Schiper,et al.  Specification of Replication Techniques, Semi-Passive Replication, and Lazy consensus* , 2002 .

[46]  Xavier Défago,et al.  Semi-passive replication , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[47]  I. Bey,et al.  Delta-4: A Generic Architecture for Dependable Distributed Computing , 1991, Research Reports ESPRIT.

[48]  Anne-Marie Déplanche,et al.  Implementing a semi-active replication strategy in CHORUS/ClassiX, a distributed real-time executive , 1999, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems.

[49]  Roberto Baldoni,et al.  Active software replication through a three-tier approach , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..

[50]  Cecilia Metra,et al.  Transient and permanent fault diagnosis for FPGA-based TMR systems , 1999, Proceedings 1999 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (EFT'99).

[51]  Robert E. Lyons,et al.  The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..

[52]  David A. Rennels Reconfigurable Modular Computer Networks for Spacecraft On-Board Processing , 1978, Computer.

[53]  K. H. Kim,et al.  Dynamic Configuration Management in Reliable Distributed Real-Time Information Systems , 1999, IEEE Trans. Knowl. Data Eng..

[54]  S. Katkoori,et al.  Selective triple Modular redundancy (STMR) based single-event upset (SEU) tolerant synthesis for FPGAs , 2004, IEEE Transactions on Nuclear Science.

[55]  Eberhard Gill,et al.  THE CHALLENGES OF INTRA-SPACECRAFT WIRELESS DATA INTERFACING , 2007 .

[56]  Jean Arlat,et al.  Fault Tolerant Computing , 1999 .

[57]  Drago Matko,et al.  Image-Based Attitude Control of a Remote Sensing Satellite , 2012, J. Intell. Robotic Syst..

[58]  Omar Emam,et al.  A Fault Detection, Isolation and Recovery (FDIR) Strategy Based on a Message Exchange Approach to Implement Autonomous FDIR Management on the MARC System , 2010 .

[59]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multi-threading alternatives , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[60]  Daniel S. Katz,et al.  NASA Advances Robotic Space Exploration , 2003, Computer.

[61]  Kim P. Gostelow,et al.  The design of a fault-tolerant, real-time, multi-core computer system , 2011, 2011 Aerospace Conference.

[62]  Jens Eickhoff,et al.  Onboard Computers, Onboard Software and Satellite Operations: An Introduction , 2011 .

[63]  Muhammad Fayyaz,et al.  Fault-Tolerant Distributed approach to satellite On-Board Computer design , 2014, 2014 IEEE Aerospace Conference.

[64]  Myoungho Sunwoo,et al.  Development of Autonomous Car—Part I: Distributed System Architecture and Development Process , 2014, IEEE Transactions on Industrial Electronics.

[65]  Philip M. Thambidurai,et al.  Interactive consistency with multiple failure modes , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[66]  James E. Smith,et al.  Configurable isolation: building high availability systems with commodity multi-core processors , 2007, ISCA '07.

[67]  Engin Ipek,et al.  Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[68]  Petru Eles,et al.  Design optimization of time- and cost-constrained fault-tolerant distributed embedded systems , 2005, Design, Automation and Test in Europe.

[69]  Patrick W. Fink,et al.  Wireless Network Communications Overview for Space Mission Operations , 2013 .

[70]  Arndt Bode,et al.  Designing Spacecraft High Performance Computing Architectures , 2013 .

[71]  D. W. Caldwell,et al.  A fault-tolerant embedded microcontroller testbed , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[72]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[73]  Ian Vince McLoughlin,et al.  Reliability through redundant parallelism for micro-satellite computing , 2010, TECS.

[74]  A. T. Tai,et al.  Design of a fault-tolerant COTS-based bus architecture , 1999 .

[75]  Wei-Pang Yang,et al.  Byzantine Agreement in the Presence of Mixed Faults on Processors and Links , 1998, IEEE Trans. Parallel Distributed Syst..

[76]  Xavier Olive FDIR for satellites , 2012 .

[77]  Hermann Kopetz,et al.  Distributed fault-tolerant real-time systems: the Mars approach , 1989, IEEE Micro.

[78]  John T. Armstrong,et al.  Wireless intra-spacecraft communication: The benefits and the challenges , 2010, 2010 NASA/ESA Conference on Adaptive Hardware and Systems.

[79]  Hermann Kopetz,et al.  Real-time systems , 2018, CSC '73.

[80]  Daniel S. Katz,et al.  High Performance Computing Systems for Autonomous Spaceborne Missions , 2001, Int. J. High Perform. Comput. Appl..

[81]  P. Reynier,et al.  Active replication in Delta-4 , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[82]  Xinsheng Wang,et al.  Fault tolerance design on onboard computer using COTS components , 2006, 2006 1st International Symposium on Systems and Control in Aerospace and Astronautics.

[83]  Daniel S. Katz,et al.  Development of a spaceborne embedded cluster , 2000, Proceedings IEEE International Conference on Cluster Computing. CLUSTER 2000.

[84]  Tanya Vladimirova,et al.  Adaptive middleware design for satellite fault-tolerant distributed computing , 2012, 2012 NASA/ESA Conference on Adaptive Hardware and Systems (AHS).

[85]  Chryssis Georgiou,et al.  Do-All Computing in Distributed Systems: Cooperation in the Presence of Adversity , 2007 .

[86]  Xavier Olive,et al.  FDI(R) for satellites: How to deal with high availability and robustness in the space domain? , 2012, Int. J. Appl. Math. Comput. Sci..

[87]  Luigi Carro,et al.  On the optimal design of triple modular redundancy logic for SRAM-based FPGAs , 2005, Design, Automation and Test in Europe.

[88]  K. H. Kim,et al.  Minimal-delay decentralized maintenance of processor-group membership in TDMA-bus LAN systems , 1993, [1993] Proceedings. The 13th International Conference on Distributed Computing Systems.

[89]  David Powell Distributed Fault-Tolerance , 1991 .

[90]  Bill Jackson A robust fault protection architecture for low-cost nanosatellites , 2014, 2014 IEEE Aerospace Conference.

[91]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[92]  Luigi Carro,et al.  Designing fault tolerant systems into SRAM-based FPGAs , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[93]  Joel R. Sklaroff,et al.  Redundancy Management Technique for Space Shuttle Computers , 1976, IBM J. Res. Dev..

[94]  David Powell,et al.  Distributed fault tolerance: lessons from Delta-4 , 1994, IEEE Micro.

[95]  Keith Marzullo,et al.  Tradeoffs in implementing primary-backup protocols , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[96]  Rachid Guerraoui,et al.  Fault-Tolerance by Replication in Distributed Systems , 1996, Ada-Europe.

[97]  Muhammad Fayyaz Task Oriented Fault-Tolerant Distributed Computing for Use on Board Spacecraft , 2016 .

[98]  Ricky W. Butler A Primer on Architectural Level Fault Tolerance , 2008 .