Towards functional safety compliance of matrix-matrix multiplication for machine learning-based autonomous systems

[1]  Irith Pomeranz,et al.  Transient-fault recovery for chip multiprocessors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[2]  Kevin Skadron,et al.  Real-world design and evaluation of compiler-managed GPU redundant multithreading , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[3]  Daniel J. Sorin,et al.  Argus-G: Comprehensive, Low-Cost Error Detection for GPGPU Cores , 2015, IEEE Computer Architecture Letters.

[4]  J CazorlaFrancisco,et al.  Multi-core Devices for Safety-critical Systems , 2020, ACM Comput. Surv..

[5]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[6]  Giorgio Buttazzo,et al.  A Safe, Secure, and Predictable Software Architecture for Deep Learning in Safety-Critical Systems , 2020, IEEE Embedded Systems Letters.

[7]  Xiang Gu,et al.  Tolerating Soft Errors in Deep Learning Accelerators with Reliable On-Chip Memory Designs , 2018, 2018 IEEE International Conference on Networking, Architecture and Storage (NAS).

[8]  Shige Wang,et al.  Fractional GPUs: Software-Based Compute and Memory Bandwidth Reservation for GPUs , 2019, 2019 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).

[9]  Kevin Skadron,et al.  Cost-effective safety and fault localization using distributed temporal redundancy , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).

[10]  Jaume Abella,et al.  High-Integrity GPU Designs for Critical Real-Time Automotive Systems , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[11]  Hyeran Jeon,et al.  Warped-DMR: Light-weight Error Detection for GPGPU , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[12]  Zaid Al-Ars,et al.  Efficient software-based fault tolerance approach on multicore platforms , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[13]  Alan D. George,et al.  Evaluation of Algorithm-Based Fault Tolerance for Machine Learning and Computer Vision under Neutron Radiation , 2020, 2020 IEEE Aerospace Conference.

[14]  Ning Ma,et al.  GPU computing performance analysis on matrix multiplication , 2019, The Journal of Engineering.

[15]  Junfeng Yang,et al.  DeepXplore , 2019, Commun. ACM.

[16]  Annibale Panichella,et al.  Testing Autonomous Cars for Feature Interaction Failures using Many-Objective Search , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[17]  Stephen W. Keckler,et al.  SwapCodes: Error Codes for Hardware-Software Cooperative GPU Pipeline Error Detection , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Tipp Moseley,et al.  Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[19]  Giuseppe Lami,et al.  Challenges in Certification of Autonomous Driving Systems , 2017, 2017 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW).

[20]  Francisco J. Cazorla,et al.  Assessing the Adherence of an Industrial Autonomous Driving Framework to ISO 26262 Software Guidelines , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[21]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[22]  Jean-Luc Poupat,et al.  The Arm Triple Core Lock-Step (TCLS) Processor , 2019, ACM Trans. Comput. Syst..

[23]  T. C. Maxino,et al.  The Effectiveness of Checksums for Embedded Control Networks , 2009, IEEE Transactions on Dependable and Secure Computing.

[24]  Michael Paulitsch,et al.  AI and Reliability Trends in Safety-Critical Autonomous Systems on Ground and Air , 2020, 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[25]  Claus Braun,et al.  A-ABFT: Autonomous Algorithm-Based Fault Tolerance for Matrix Multiplications on Graphics Processing Units , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[26]  Shuyuan Yang,et al.  A Survey of Deep Learning-Based Object Detection , 2019, IEEE Access.

[27]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[28]  Rick Salay,et al.  An Analysis of ISO 26262: Using Machine Learning Safely in Automotive Software , 2017, ArXiv.

[29]  Alberto Bosio,et al.  A Reliability Analysis of a Deep Neural Network , 2019, 2019 IEEE Latin American Test Symposium (LATS).

[30]  Alfonso Niño,et al.  A Survey of Parallel Programming Models and Tools in the Multi and Many-core Era , 2022 .

[31]  Franck Cappello,et al.  FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks , 2021, IEEE Transactions on Parallel and Distributed Systems.

[32]  Altamiro Amadeu Susin,et al.  Reliability analysis on case-study traffic sign convolutional neural network on APSoC , 2018, 2018 IEEE 19th Latin-American Test Symposium (LATS).

[33]  Guanpeng Li,et al.  Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[34]  Jian Fu,et al.  On-demand thread-level fault detection in a concurrent programming environment , 2013, 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

[35]  Iosif Mporas,et al.  A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures , 2016, The Journal of Supercomputing.

[36]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multi-threading alternatives , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[37]  Pratibha Rathi,et al.  YOLO v3-Tiny: Object Detection and Recognition using one stage improved model , 2020, 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS).

[38]  Luigi Carro,et al.  Kernel and layer vulnerability factor to evaluate object detection reliability in GPUs , 2018, IET Comput. Digit. Tech..

[39]  Christian Berger Accelerating Regression Testing for Scaled Self-Driving Cars with Lightweight Virtualization -- A Case Study , 2015, 2015 IEEE/ACM 1st International Workshop on Software Engineering for Smart Cyber-Physical Systems.

[40]  Levent Gurel,et al.  Comparative benchmarking: matrix multiplication on a multicore coprocessor and a GPU , 2015, 2015 Computational Electromagnetics International Workshop (CEM).

[41]  Engin Ipek,et al.  Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[42]  Luigi Carro,et al.  Evaluation and Mitigation of Soft-Errors in Neural Network-Based Object Detection in Three GPU Architectures , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[43]  Alberto Bosio,et al.  Evaluating Convolutional Neural Networks Reliability depending on their Data Representation , 2020, 2020 23rd Euromicro Conference on Digital System Design (DSD).

[44]  Huiyang Zhou,et al.  Understanding software approaches for GPGPU reliability , 2009, GPGPU-2.

[45]  Rakesh Rana,et al.  Early Verification and Validation According to ISO 26262 by Combining Fault Injection and Mutation Testing , 2013, ICSOFT.

[46]  Tipp Moseley,et al.  PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures , 2009, IEEE Transactions on Dependable and Secure Computing.

[47]  Aviral Shrivastava,et al.  EXPERT: Effective and flexible error protection by redundant multithreading , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[48]  Jaume Abella,et al.  Software-only Diverse Redundancy on GPUs for Autonomous Driving Platforms , 2019, 2019 IEEE 25th International Symposium on On-Line Testing and Robust System Design (IOLTS).