Write-Avoiding Algorithms

Communication, i.e., moving data between levels of a memory hierarchy or between processors over a network, is much more expensive (in time or energy) than arithmetic. There has thus been a recent focus on designing algorithms that minimize communication and, when possible, attain lower bounds on the total number of reads and writes. However, most previous work does not distinguish between the costs of reads and writes. Writes can be much more expensive than reads in some current and emerging storage devices such as nonvolatile memories. This motivates us to ask whether there are lower bounds on the number of writes that certain algorithms must perform, and whether these bounds are asymptotically smaller than bounds on the sum of reads and writes together. When these smaller lower bounds exist, we then ask when they are attainable, we call such algorithms "write-avoiding" (WA), to distinguish them from "communication-avoiding" (CA) algorithms, which only minimize the sum of reads and writes. We identify a number of cases in linear algebra and direct N-body methods where known CA algorithms are also WA (some are and some aren't). We also identify classes of algorithms, including Strassen's matrix multiplication, Cooley-Tukey FFT, and cache oblivious algorithms for classical linear algebra, where a WA algorithm cannot exist: the number of writes is unavoidably within a constant factor of the total number of reads and writes. We explore the interaction of WA algorithms with cache replacement policies and argue that the Least Recently Used policy works well with the WA algorithms in this paper. We provide empirical hardware counter measurements from Intel's Nehalem-EX microarchitecture to validate our theory. In the parallel case, for classical linear algebra, we show that it is impossible to attain lower bounds both on interprocessor communication and on writes to local memory, but either one is attainable by itself. Finally, we discuss WA algorithms for sparse iterative linear algebra.

[1]  H. Whitney,et al.  An inequality related to the isoperimetric inequality , 1949 .

[2]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[3]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[4]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[5]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[6]  Robert E. Tarjan,et al.  Amortized efficiency of list update and paging rules , 1985, CACM.

[7]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[8]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[9]  Ramesh C. Agarwal,et al.  A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication , 1994, IBM J. Res. Dev..

[10]  Robert A. van de Geijn,et al.  SUMMA: Scalable Universal Matrix Multiplication Algorithm , 1995 .

[11]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[12]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[13]  Marc Snir,et al.  GETTING UP TO SPEED THE FUTURE OF SUPERCOMPUTING , 2004 .

[14]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[15]  Sivan Toledo,et al.  Algorithms and data structures for flash memories , 2005, CSUR.

[16]  Alexander Tiskin Communication-efficient parallel generic pairwise elimination , 2007, Future Gener. Comput. Syst..

[17]  Sivan Toledo,et al.  Characterizing the Performance of Flash Memory Storage Devices and Its Impact on Algorithm Design , 2008, WEA.

[18]  Vijayalakshmi Srinivasan,et al.  Enhancing lifetime and security of PCM-based Main Memory with Start-Gap Wear Leveling , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[20]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[21]  K. Gopalakrishnan,et al.  Phase change memory technology , 2010, 1001.1164.

[22]  Guy E. Blelloch,et al.  Low depth cache-oblivious algorithms , 2010, SPAA '10.

[23]  James Demmel,et al.  CALU: A Communication Optimal LU Factorization Algorithm , 2011, SIAM J. Matrix Anal. Appl..

[24]  Sivan Toledo,et al.  Competitive analysis of flash memory algorithms , 2011, TALG.

[25]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[26]  Paolo Mattavelli,et al.  A 4 Mb LV MOS-Selected Embedded Phase Change Memory in 90 nm Standard CMOS Technology , 2011, IEEE Journal of Solid-State Circuits.

[27]  Rajesh K. Gupta,et al.  NV-Heaps: making persistent objects fast and safe with next-generation, non-volatile memories , 2011, ASPLOS XVI.

[28]  David Eklov,et al.  Cache Pirating: Measuring the Curse of the Shared Cache , 2011, 2011 International Conference on Parallel Processing.

[29]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[30]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[31]  Dong Li,et al.  Identifying Opportunities for Byte-Addressable Non-Volatile Memory in Extreme-Scale Scientific Applications , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[32]  James Demmel,et al.  Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication , 2012, MedAlg.

[33]  James Demmel,et al.  Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds , 2012, SPAA '12.

[34]  Amin Vahdat,et al.  Themis: an I/O-efficient MapReduce , 2012, SoCC '12.

[35]  Guy E. Blelloch,et al.  Cache and I/O efficent functional algorithms , 2013, POPL.

[36]  Graph expansion and communication costs of fast matrix multiplication , 2012, JACM.

[37]  Katherine A. Yelick,et al.  A Communication-Optimal N-Body Algorithm for Direct Interactions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[38]  Benton Calhoun,et al.  A 0.6V 8 pJ/write non-volatile CBRAM macro embedded in a body sensor node for ultra low energy applications , 2013, 2013 Symposium on VLSI Circuits.

[39]  James Demmel,et al.  Avoiding Communication in Nonsymmetric Lanczos-Based Krylov Subspace Methods , 2013, SIAM J. Sci. Comput..

[40]  James Demmel,et al.  Communication lower bounds and optimal algorithms for programs that reference arrays - Part 1 , 2013, ArXiv.

[41]  Samuel H. Fuller,et al.  The Future of Computing Performance: Game Over or Next Level? , 2014 .

[42]  Katherine A. Yelick,et al.  A Computation- and Communication-Optimal Parallel Direct 3-Body Algorithm , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[43]  Wei-Che Tseng,et al.  Scheduling to Optimize Cache Utilization for Non-Volatile Main Memories , 2014, IEEE Transactions on Computers.

[44]  P. Mueller,et al.  PSS : A prototype storage subsystem based on PCM , 2014 .

[45]  James Demmel,et al.  Communication lower bounds and optimal algorithms for numerical linear algebra*† , 2014, Acta Numerica.

[46]  Krste Asanovic,et al.  FireBox: A Hardware Building Block for 2020 Warehouse-Scale Computers , 2014 .

[47]  Jin Xiong,et al.  A Survey of Phase Change Memory Systems , 2015, Journal of Computer Science and Technology.

[48]  Guy E. Blelloch,et al.  Sorting with Asymmetric Read and Write Costs , 2015, SPAA.