论文信息 - Optimizing matrix transposes using a POWER7 cache model and explicit prefetching

Optimizing matrix transposes using a POWER7 cache model and explicit prefetching

We consider the problem of efficiently computing matrix transposes on the POWER7 architecture. We develop a matrix transpose algorithm that uses cache blocking, cache prefetching and data alignment. We model the POWER7 data cache and memory concurrency and use the model to predict the memory throughput of the proposed matrix transpose algorithm. The performance of our matrix transpose algorithm is up to five times higher than that of the dgetmo routine of the Engineering and Scientific Subroutine Library and is 2.5 times higher than that of the code generated by compiler-inserted prefetching. Numerical experiments indicate a good agreement between the predicted and the measured memory throughput.

[1] David A. Patterson,et al. Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[2] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .

[3] Balaram Sinharoy,et al. IBM POWER7 multicore server processor , 2011 .

[4] David A. Patterson,et al. Computer Architecture - A Quantitative Approach (4. ed.) , 2007 .

[5] John McCalpin,et al. Automatic benchmark generation for cache optimization of matrix operations , 1995, ACM-SE 33.

[6] Siddhartha Chatterjee,et al. Cache-efficient matrix transposition , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[7] Charles E. Leiserson,et al. Cache-Oblivious Algorithms , 2003, CIAC.

[8] Ramakrishnan Rajamony,et al. PERCS: The IBM POWER7-IH high-performance computing system , 2011, IBM J. Res. Dev..

[9] Balaram Sinharoy,et al. POWER7: IBM's next generation server processor , 2010, 2009 IEEE Hot Chips 21 Symposium (HCS).