The Hierarchical Memory Machine Model for GPUs

The Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM) are theoretical parallel computing models that capture the essence of the shared memory access and the global memory access of GPUs. The main contribution of this paper is to introduce the Hierarchical Memory Machine (HMM), which consists of multiple DMMs and a single UMM. The HMM is a more practical parallel computing model which reflects the architecture of current GPUs. We present several fundamental algorithms on the HMM. First, we show that the sum of n numbers can be computed in O(n/w + nl/p + l + log n) time units using p threads on the HMM with width ω and latency l, and prove that this computing time is optimal. We also show that the direct convolution of m and m + n - 1 numbers can be done in O(n/w + mn/dw + nl/p + l+ log m) time units using p threads on the HMM with d DMMs, width ω, and latency l. Finally, we prove that our implementation of the direct convolution is time optimal.

[1]  Ami Marowka,et al.  Parallel Scientific Computation: A Structured Approach using BSP and MPI , 2006, Scalable Comput. Pract. Exp..

[2]  Koji Nakano,et al.  Fast and Accurate Template Matching Using Pixel Rearrangement on the GPU , 2011, 2011 Second International Conference on Networking and Computing.

[3]  Robert H. Halstead,et al.  Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming , 1993, ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming.

[4]  John Beidler,et al.  Data Structures and Algorithms , 1996, Wiley Encyclopedia of Computer Science and Engineering.

[5]  Koji Nakano Asynchronous Memory Machine Models with Barrier Synchronization , 2012, 2012 Third International Conference on Networking and Computing.

[6]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[7]  Jerry L. Trahan,et al.  Dynamic Reconfiguration: Architectures and Algorithms (Series in Computer Science (Kluwer Academic/Plenum Publishers).) , 2004 .

[8]  Koji Nakano An Optimal Parallel Prefix-Sums Algorithm on the Memory Machine Models for GPUs , 2012, ICA3PP.

[9]  Koji Nakano,et al.  An Efficient GPU Implementation of Ant Colony Optimization for the Traveling Salesman Problem , 2012, 2012 Third International Conference on Networking and Computing.

[10]  Koji Nakano,et al.  Accelerating the Dynamic Programming for the Optimal Polygon Triangulation on the GPU , 2012, ICA3PP.

[11]  Rob H. Bisseling,et al.  Parallel Scientific Computation , 2004 .

[12]  Wojciech Rytter,et al.  Efficient parallel algorithms , 1988 .

[13]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[14]  Koji Nakano,et al.  A GPU Implementation of Computing Euclidean Distance Map with Efficient Memory Access , 2011, 2011 Second International Conference on Networking and Computing.

[15]  Koji Nakano,et al.  Fast Ellipse Detection Algorithm Using Hough Transform on the GPU , 2011, 2011 Second International Conference on Networking and Computing.

[16]  Koji Nakano Efficient Implementations of the Approximate String Matching on the Memory Machine Models , 2012, 2012 Third International Conference on Networking and Computing.

[17]  Koji Nakano,et al.  Simple Memory Machine Models for GPUs , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[18]  Wen-mei W. Hwu,et al.  GPU Computing Gems Emerald Edition , 2011 .

[19]  M. J. Quinn,et al.  Parallel Computing: Theory and Practice , 1994 .

[20]  Koji Nakano,et al.  Implementations of a Parallel Algorithm for Computing Euclidean Distance Map in Multicore Processors and GPUs , 2011, Int. J. Netw. Comput..

[21]  Akihiko Kasagi,et al.  An Implementation of Conflict-Free Offline Permutation on the GPU , 2012, 2012 Third International Conference on Networking and Computing.

[22]  Michael J. Flynn,et al.  Some Computer Organizations and Their Effectiveness , 1972, IEEE Transactions on Computers.

[23]  Koji Nakano,et al.  Efficient Canny Edge Detection Using a GPU , 2010, 2010 First International Conference on Networking and Computing.

[24]  Ramachandran Vaidyanathan,et al.  Dynamic reconfiguration - architectures and algorithms , 2003, Series in computer science.