Helix: Algorithm/Architecture Co-design for Accelerating Nanopore Genome Base-calling

Nanopore genome sequencing is the key to enabling personalized medicine, global food security, and virus surveillance. The state-of-the-art base-callers adopt deep neural networks (DNNs) to translate electrical signals generated by nanopore sequencers to digital DNA symbols. A DNN-based base-caller consumes 44.5% of total execution time of a nanopore sequencing pipeline. However, it is difficult to quantize a base-caller and build a power-efficient processing-in-memory (PIM) to run the quantized base-caller. Although conventional network quantization techniques reduce the computing overhead of a base-caller by replacing floating-point multiply-accumulations by cheaper fixed-point operations, it significantly increases the number of systematic errors that cannot be corrected by read votes. The power density of prior nonvolatile memory (NVM)-based PIMs has already exceeded memory thermal tolerance even with active heat sinks, because their power efficiency is severely limited by analog-to-digital converters (ADC). Finally, Connectionist Temporal Classification (CTC) decoding and read voting cost 53.7% of total execution time in a quantized base-caller, and thus became its new bottleneck. In this paper, we propose a novel algorithm/architecture co-designed PIM, Helix, to power-efficiently and accurately accelerate nanopore base-calling. From algorithm perspective, we present systematic error aware training to minimize the number of systematic errors in a quantized base-caller. From architecture perspective, we propose a low-power SOT-MRAM-based ADC array to process analog-to-digital conversion operations and improve power efficiency of prior DNN PIMs. Moreover, we revised a traditional NVM-based dot-product engine to accelerate CTC decoding operations, and create a SOT-MRAM binary comparator array to process read voting. Compared to state-of-the-art PIMs, Helix improves base-calling throughput by 6x, throughput per Watt by 11.9x and per mm2 by 7.5x without degrading base-calling accuracy.

[1]  T. Endoh,et al.  First demonstration of field-free SOT-MRAM with 0.35 ns write speed and 70 thermal stability under 400°C thermal tolerance by canted SOT structure and its advanced patterning/SOT channel technology , 2019, 2019 IEEE International Electron Devices Meeting (IEDM).

[2]  William J. Dally,et al.  Darwin-WGA: A Co-processor Provides Increased Sensitivity in Whole Genome Alignments with High Speedup , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[3]  Robert J. Fischer,et al.  Nanopore Sequencing as a Rapidly Deployable Ebola Outbreak Tool , 2016, Emerging infectious diseases.

[4]  Lei Jiang,et al.  FindeR: Accelerating FM-Index-Based Exact Pattern Matching in Genomic Sequences through ReRAM Technology , 2019, 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  Chia-Lin Yang,et al.  Sparse ReRAM Engine: Joint Exploration of Activation and Weight Sparsity in Compressed Neural Networks , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[6]  Kang L. Wang,et al.  Low-Power, High-Density Spintronic Programmable Logic With Voltage-Gated Spin Hall Effect in Magnetic Tunnel Junctions , 2016, IEEE Magnetics Letters.

[7]  Yuan Xie,et al.  A Study on Practically Unlimited Endurance of STT-MRAM , 2017, IEEE Transactions on Electron Devices.

[8]  S. Koren,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, bioRxiv.

[9]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[10]  David A. Patterson,et al.  FPGA Accelerated INDEL Realignment in the Cloud , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[11]  Jeong-Heon Park,et al.  Dependence of Voltage and Size on Write Error Rates in Spin-Transfer Torque Magnetic Random-Access Memory , 2016, IEEE Magnetics Letters.

[12]  Kaushik Roy,et al.  Design of a Low-Voltage Analog-to-Digital Converter Using Voltage-Controlled Stochastic Switching of Low Barrier Nanomagnets , 2018, IEEE Magnetics Letters.

[13]  Tomáš Vinař,et al.  DeepNano: Deep recurrent neural networks for base calling in MinION nanopore reads , 2016, PloS one.

[14]  Dmitri B. Strukov,et al.  Race Logic: A hardware acceleration for dynamic programming algorithms , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[15]  Ryan R. Wick,et al.  Performance of neural network basecalling tools for Oxford Nanopore sequencing , 2019, Genome Biology.

[16]  E. Holmes,et al.  Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding , 2020, The Lancet.

[17]  S. Ambrogio,et al.  Reducing the Impact of Phase-Change Memory Conductance Drift on the Inference of large-scale Hardware Neural Networks , 2019, 2019 IEEE International Electron Devices Meeting (IEDM).

[18]  Onur Mutlu,et al.  Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions , 2017, Briefings Bioinform..

[19]  Minh Duc Cao,et al.  Chiron: translating nanopore raw signal directly into nucleotide sequence using deep learning , 2017, bioRxiv.

[20]  Mary Wootters,et al.  The N3XT Approach to Energy-Efficient Abundant-Data Computing , 2019, Proceedings of the IEEE.

[21]  Weisheng Zhao,et al.  Heterogeneous Memristive Devices Enabled by Magnetic Tunnel Junction Nanopillars Surrounded by Resistive Silicon Switches , 2017, 1708.00372.

[22]  Awni Hannun,et al.  Sequence Modeling with CTC , 2017 .

[23]  William J. Dally,et al.  Darwin: A Genomics Co-processor Provides up to 15,000X Acceleration on Long Read Assembly , 2018, USENIX Annual Technical Conference.

[24]  Cong Xu,et al.  NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[25]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[26]  Hao Yan,et al.  CELIA: A Device and Architecture Co-Design Framework for STT-MRAM-Based Deep Learning Acceleration , 2018, ICS.

[27]  Dong Li,et al.  Integrated Thermal Analysis for Processing In Die-Stacking Memory , 2016, MEMSYS.

[28]  Oliver G. Pybus,et al.  Mobile real-time surveillance of Zika virus in Brazil , 2016, Genome Medicine.

[29]  Hongbin Zha,et al.  Alternating Multi-bit Quantization for Recurrent Neural Networks , 2018, ICLR.

[30]  Yuan Xie,et al.  MEDAL: Scalable DIMM based Near Data Processing Accelerator for DNA Seeding Algorithm , 2019, MICRO.

[31]  David Blaauw,et al.  GenAx: A Genome Sequencing Accelerator , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[32]  Yan Wang,et al.  Fully Quantized Network for Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Dejan S. Milojicic,et al.  PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference , 2019, ASPLOS.

[34]  Akino Shiroma,et al.  Advantages of genome sequencing by long-read sequencer using SMRT technology in medical area , 2017, Human Cell.

[35]  Scott A. Mahlke,et al.  In-Memory Data Parallel Processor , 2018, ASPLOS.

[36]  Sachin S. Talathi,et al.  Fixed Point Quantization of Deep Convolutional Networks , 2015, ICML.