Achieving Scalability in a k-NN Multi-GPU Network Service with Centaur

Centaur is a GPU-centric architecture for building a low-latency approximate k-Nearest-Neighbors network server. We implement a multi-GPU distributed data flow runtime which enables efficient and scalable network request processing on GPUs. The runtime eliminates GPU management overheads from the CPU, making the server throughput and response time largely agnostic to the CPU load, speed or the number of dedicated CPU cores. Our experiments systems show that our server achieves near-perfect scaling for 16 GPUs, beating the throughput of a highly-optimized CPU-driven server by 35% while maintaining about 2msec average request latency. Furthermore, it requires only a single CPU core to run, achieving over an order of magnitude higher throughput than the standard CPU-driven server architecture in this setting.

[1]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[2]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Computing k-Nearest Neighbors , 1975, IEEE Transactions on Computers.

[3]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[4]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[5]  Rajkumar Buyya,et al.  High Performance Cluster Computing , 1999 .

[6]  Michel Barlaud,et al.  Fast k nearest neighbor search using GPU , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[7]  C. Moallemi,et al.  The Cost of Latency ∗ , 2009 .

[8]  Liheng Jian,et al.  CUKNN: A parallel implementation of K-nearest neighbor on CUDA-enabled GPU , 2009, 2009 IEEE Youth Conference on Information, Computing and Telecommunication.

[9]  Lei Zhao,et al.  A Practical GPU Based KNN Algorithm , 2009 .

[10]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[11]  Frank Nielsen,et al.  K-nearest neighbor search: Fast GPU-based implementations and application to high-dimensional feature matching , 2010, 2010 IEEE International Conference on Image Processing.

[12]  Matthijs Douze,et al.  Searching in one billion vectors: Re-rank with source coding , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Mark Silberstein,et al.  PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.

[14]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Sayantan Sur,et al.  MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.

[16]  Vivek Sarkar,et al.  Dynamic Task Parallelism with a GPU Work-Stealing Runtime System , 2011, LCPC.

[17]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[18]  Seungyeop Han,et al.  SSLShader: Cheap SSL Acceleration with Commodity Processors , 2011, NSDI.

[19]  Jeff A. Stuart,et al.  A study of Persistent Threads style GPU programming for GPGPU workloads , 2012, 2012 Innovative Parallel Computing (InPar).

[20]  John D. Owens,et al.  A GPU Task-Parallel Model with Dependency Resolution , 2012, Computer.

[21]  R. Govindarajan,et al.  Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.

[22]  Laxmi N. Bhuyan,et al.  A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures , 2013, TACO.

[23]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[24]  Idit Keidar,et al.  GPUfs: Integrating a file system with GPUs , 2013, TOCS.

[25]  Mark Silberstein,et al.  GPUnet , 2014, OSDI.

[26]  Rashmi Agrawal K-Nearest Neighbor for Uncertain Data , 2014 .

[27]  Jun Pang,et al.  Rhythm: harnessing data parallel hardware for server workloads , 2014, ASPLOS.

[28]  Matt Welsh SEDA: An Architecture for Highly Concurrent Server Applications , 2015 .

[29]  Mike O'Connor,et al.  MemcachedGPU: scaling-up scale-out key-value stores , 2015, SoCC.

[30]  Mark Silberstein,et al.  GPUrdma: GPU-side library for high performance networking from GPU kernels , 2016, ROSS@HPDC.

[31]  Torsten Hoefler,et al.  dCUDA: Hardware Supported Overlap of Computation and Communication , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[32]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[33]  Mark Silberstein,et al.  ActivePointers: A Case for Software Address Translation on GPUs , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[34]  Rudolf Eigenmann,et al.  Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks , 2017, PPOPP.