A polyphase filter for many-core architectures

In this article we discuss our implementation of a polyphase filter for real-time data processing in radio astronomy. We describe in detail our implementation of the polyphase filter algorithm and its behaviour on three generations of NVIDIA GPU cards, on dual Intel Xeon CPUs and the Intel Xeon Phi (Knights Corner) platforms. All of our implementations aim to exploit the potential for data reuse that the algorithm offers. Our GPU implementations explore two different methods for achieving this, the first makes use of L1/Texture cache, the second uses shared memory. We discuss the usability of each of our implementations along with their behaviours. We measure performance in execution time, which is a critical factor for real-time systems, we also present results in terms of bandwidth (GB/s), compute (GFlop/s) and type conversions (GTc/s). We include a presentation of our results in terms of the sample rate which can be processed in real-time by a chosen platform, which more intuitively describes the expected performance in a signal processing setting. Our findings show that, for the GPUs considered, the performance of our polyphase filter when using lower precision input data is limited by type conversions rather than device bandwidth. We compare these results to an implementation on the Xeon Phi. We show that our Xeon Phi implementation has a performance that is 1.47x to 1.95x greater than our CPU implementation, however is not insufficient to compete with the performance of GPUs. We conclude with a comparison of our best performing code to two other implementations of the polyphase filter, showing that our implementation is faster in nearly all cases. This work forms part of the Astro-Accelerate project, a many-core accelerated real-time data processing library for digital signal processing of time-domain radio astronomy data.

[1]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[2]  Chris Jesshope,et al.  A polyphase filter for GPUs and multi-core processors , 2012, Astro-HPC '12.

[3]  Henri E. Bal,et al.  Real-Time Pulsars Pipeline Using Many-Cores , 2014 .

[4]  Aris Karastergiou,et al.  ARTEMIS: A real-time data processing pipeline for the detection of fast transients , 2015 .

[5]  Chris Williams,et al.  Observations of transients and pulsars with LOFAR international stations , 2012 .

[6]  Nathan Clarke,et al.  A Multi-Beam Radio Transient Detector With Real-Time De-Dispersion Over a Wide DM Range , 2014, 1403.2468.

[7]  Kristian Zarb Adami,et al.  MULTIBEAM GPU TRANSIENT PIPELINE FOR THE MEDICINA BEST-2 ARRAY , 2013 .

[8]  Hong Chen,et al.  A GPU-Based Wide-Band Radio Spectrometer , 2014, Publications of the Astronomical Society of Australia.

[9]  Richard G. Lyons,et al.  Understanding Digital Signal Processing , 1996 .

[10]  C. Williams,et al.  A GPU-based survey for millisecond radio transients using ARTEMIS , 2011, 1111.6399.

[11]  Michael Klemm,et al.  OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison , 2012, MARC@RWTH.

[12]  A. Noutsos,et al.  Limits on fast radio bursts at 145 MHz with ARTEMIS, a real-time software backend , 2015, 1506.03370.

[13]  Roy H. Stogner,et al.  Early Experiences Porting Scientific Applications to the Many Integrated Core ( MIC ) Platform , 2012 .