Exploiting vector instructions with generalized stream fusio

Stream fusion is a powerful technique for automatically transforming high-level sequence-processing functions into efficient implementations. It has been used to great effect in Haskell libraries for manipulating byte arrays, Unicode text, and unboxed vectors. However, some operations, like vector append, still do not perform well within the standard stream fusion framework. Others, like SIMD computation using the SSE and AVX instructions available on modern x86 chips, do not seem to fit in the framework at all. In this paper we introduce generalized stream fusion, which solves these issues. The key insight is to bundle together multiple stream representations, each tuned for a particular class of stream consumer. We also describe a stream representation suited for efficient computation with SSE instructions. Our ideas are implemented in modified versions of the GHC compiler and vector library. Benchmarks show that high-level Haskell code written using our compiler and libraries can produce code that is faster than both compiler- and hand-vectorized C.

[1]  Josef Svenningsson Shortcut fusion for accumulating parameters & zip-like functions , 2002, ICFP '02.

[2]  Simon Peyton Jones,et al.  Playing by the rules: rewriting as a practical optimisation technique in GHC , 2001 .

[3]  Simon L. Peyton Jones,et al.  C--: A Portable Assembly Language that Supports Garbage Collection , 1999, PPDP.

[4]  Simon L. Peyton Jones,et al.  Unboxed Values as First Class Citizens in a Non-Strict Functional Language , 1991, FPCA.

[5]  Akihiko Takano,et al.  Shortcut deforestation in calculational form , 1995, FPCA '95.

[6]  Patricia Johann Short Cut Fusion: Proved and Improved , 2001, SAIG.

[7]  Geoff W. Hamilton,et al.  Extending Higher-Order Deforestation: Transforming Programs to Eliminate Even More Trees , 2001, Scottish Functional Programming Workshop.

[8]  Simon Peyton-Jones Call-pattern specialisation for Haskell programs , 2007, ICFP 2007.

[9]  Roman Leshchinskiy,et al.  Stream fusion: from lists to streams to nothing at all , 2007, ICFP '07.

[10]  Guy E. Blelloch,et al.  Implementation of a portable nested data-parallel language , 1993, PPOPP '93.

[11]  Simon L. Peyton Jones,et al.  Data parallel Haskell: a status report , 2007, DAMP '07.

[12]  Simon L. Peyton Jones,et al.  A short cut to deforestation , 1993, FPCA '93.

[13]  Simon L. Peyton Jones,et al.  State in Haskell , 1995, LISP Symb. Comput..

[14]  Philip Wadler,et al.  Deforestation: Transforming Programs to Eliminate Trees , 1988, Theoretical Computer Science.

[15]  Simon L. Peyton Jones,et al.  Associated types with class , 2005, POPL '05.

[16]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[17]  E. LESTER SMITH,et al.  AND OTHERS , 2005 .

[18]  Gabriele Keller,et al.  Efficient parallel stencil convolution in Haskell , 2012 .

[19]  Simon L. Peyton Jones Harnessing the Multicores: Nested Data Parallelism in Haskell , 2008, APLAS.

[20]  Simon L. Peyton Jones Call-pattern specialisation for Haskell programs , 2007, ICFP '07.

[21]  Oege de Moor Stream fusion: practical shortcut fusion for coinductive sequence types , 2011 .

[22]  William Kahan,et al.  Pracniques: further remarks on reducing truncation errors , 1965, CACM.

[23]  Todd L. Veldhuizen,et al.  Expression templates , 1996 .

[24]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.

[25]  Simon L. Peyton Jones,et al.  Making a fast curry: push/enter vs. eval/apply for higher-order languages , 2004, ICFP '04.

[26]  Todd L. Veldhuizen,et al.  Arrays in Blitz++ , 1998, ISCOPE.

[27]  Simon Peyton Jones,et al.  Guiding parallel array fusion with indexed types , 2013, Haskell 2013.

[28]  Walid Taha,et al.  Semantics, Applications, and Implementation of Program Generation , 2001, Lecture Notes in Computer Science.

[29]  Simon L. Peyton Jones,et al.  Regular, shape-polymorphic, parallel arrays in Haskell , 2010, ICFP '10.

[30]  Manuel M. T. Chakravarty,et al.  An llVM backend for GHC , 2010 .

[31]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[32]  Philip Wadler,et al.  Deforestation for Higher-Order Functions , 1992, Functional Programming.