Coordinating the use of GPU and CPU for improving performance of compute intensive applications

GPUs have recently evolved into very fast parallel co-processors capable of executing general purpose computations extremely efficiently. At the same time, multi-core CPUs evolution continued and today's CPUs have 4-8 cores. These two trends, however, have followed independent paths in the sense that we are aware of very few works that consider both devices cooperating to solve general computations. In this paper we investigate the coordinated use of CPU and GPU to improve efficiency of applications even further than using either device independently. We use Anthill runtime environment, a data-flow oriented framework in which applications are decomposed into a set of event-driven filters, where for each event, the runtime system can use either GPU or CPU for its processing. For evaluation, we use a histopathology application that uses image analysis techniques to classify tumor images for neuroblas-toma prognosis. Our experimental environment includes dual and octa-core machines, augmented with GPUs and we evaluate our approach's performance for standalone and distributed executions. Our experiments show that a pure GPU optimization of the application achieved a factor of 15 to 49 times improvement over the single core CPU version, depending on the versions of the CPUs and GPUs. We also show that the execution can be further reduced by a factor of about 2 by using our runtime system that effectively choreographs the execution to run cooperatively both on GPU and on a single core of CPU. We improve on that by adding more cores, all of which were previously neglected or used ineffectively. In addition, the evaluation on a distributed environment has shown near linear scalability to multiple hosts.

[1]  David Parham,et al.  The Problems and Promise of Central Pathology Review: Development of a Standardized Procedure for the Children's Oncology Group , 2007, Pediatric and developmental pathology : the official journal of the Society for Pediatric Pathology and the Paediatric Pathology Society.

[2]  Debra A. Hensgen,et al.  The relative performance of various mapping algorithms is independent of sizable variances in run-time predictions , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).

[3]  Larry L. Peterson,et al.  A dynamic network architecture , 1992, TOCS.

[4]  Lúcia Maria de A. Drummond,et al.  Anthill: a scalable run-time environment for data mining applications , 2005, 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'05).

[5]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[6]  Matti A. Hiltunen,et al.  Coyote: a system for constructing fine-grain configurable communication services , 1998, TOCS.

[7]  Jun Kong,et al.  Computer-aided prognosis of neuroblastoma on whole-slide images: Classification of stromal development , 2009, Pattern Recognit..

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Bingsheng He,et al.  Mars: Accelerating MapReduce with Graphics Processors , 2011, IEEE Transactions on Parallel and Distributed Systems.

[12]  Joel H. Saltz,et al.  DataCutter: Middleware for Filtering Very Large Scientific Datasets on Archival Storage Systems , 2000, IEEE Symposium on Mass Storage Systems.

[13]  Hiroyuki Shimada,et al.  Terminology and morphologic criteria of neuroblastic tumors , 1999, Cancer.

[14]  Francine Berman,et al.  Application-Level Scheduling on Distributed Heterogeneous Networks , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[15]  Yoonho Park,et al.  SPC: a distributed, scalable platform for data mining , 2006, DMSSP '06.

[16]  Teresa H. Y. Meng,et al.  Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.

[17]  Wagner Meira,et al.  Achieving Multi-Level Parallelism in the Filter-Labeled Stream Programming Model , 2008, 2008 37th International Conference on Parallel Processing.