暂无分享,去创建一个
Jianyu Huang | Ping Tak Peter Tang | Sihuan Li | Jongsoo Park | Harish Dattatraya Dixit | Daya Khudia | Zizhong Chen | Jongsoo Park | P. T. P. Tang | Zizhong Chen | Jianyu Huang | Sihuan Li | D. Khudia | H. Dixit
[1] Zizhong Chen,et al. Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.
[2] Stephen W. Keckler,et al. Making Convolutions Resilient Via Algorithm-Based Error Detection Techniques , 2020, IEEE Transactions on Dependable and Secure Computing.
[3] Dingwen Tao,et al. Silent Data Corruption Resilient Two-sided Matrix Factorizations , 2017, PPoPP.
[4] Jiyan Yang,et al. Post-Training 4-bit Quantization on Embedding Tables , 2019, ArXiv.
[5] David I. August,et al. SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.
[6] Dingwen Tao,et al. Correcting soft errors online in fast fourier transform , 2017, SC.
[7] Debjit Das Sarma,et al. Compute Solution for Tesla's Full Self-Driving Computer , 2020, IEEE Micro.
[8] Franck Cappello,et al. Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[9] Kartheek Rangineni,et al. ThUnderVolt: Enabling Aggressive Voltage Underscaling and Timing Error Resilience for Energy Efficient Deep Learning Accelerators , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).
[10] Guanpeng Li,et al. Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[11] Luigi Carro,et al. Analyzing and Increasing the Reliability of Convolutional Neural Networks on GPUs , 2019, IEEE Transactions on Reliability.
[12] Bo Chen,et al. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[13] Bin Nie,et al. Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities , 2017, 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).
[14] Mikhail Smelyanskiy,et al. FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference , 2021, ArXiv.
[15] Robert Baumann,et al. Soft errors in advanced computer systems , 2005, IEEE Design & Test of Computers.
[16] Al Geist,et al. Supercomputing's monster in the closet , 2016, IEEE Spectrum.
[17] Jingyuan Zhang,et al. AIBox: CTR Prediction Model Training on a Single Node , 2019, CIKM.
[18] Kai Zhao,et al. Fault Tolerant One-sided Matrix Decompositions on Heterogeneous Systems with GPUs , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[19] Shuaiwen Song,et al. New-Sum: A Novel Online ABFT Scheme For General Iterative Methods , 2016, HPDC.
[20] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[21] Ping Li,et al. Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems , 2020, MLSys.
[22] Kai Ren,et al. Kraken: Memory-Efficient Continual Learning for Large-Scale Real-Time Recommendations , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
[23] Sriram Sankar,et al. Silent Data Corruptions at Scale , 2021, ArXiv.