FastFlow: High-level and Efficient Streaming on Multi-core

Computer hardware manufacturers have moved decisively to multi-core and are currently experimenting with increasingly advanced many-core architectures. In the long term, writing efficient, portable and correct parallel programs targeting multiand many-core architectures must become no more challenging than writing the same programs for sequential computers. To date, however, most applications running on multicore machines do not exploit fully the potential of these architectures. This situation is in part due to the scarcity of good high-level programming tools suitable for multi/manycore architectures, and in part to the fact that multi-core programming is still viewed as a kind of exotic branch of high-performance computing (HPC) rather than being perceived as the de facto standard programming practice for the masses. Some efforts have been made to provide programmers with tools suitable for mapping data parallel computations onto both multi-cores and GPUs–the most popular many-core currently available. Tools have also been developed to support stream parallel computations [34, 31] as stream parallelism de facto represents a pattern characteristic of a large class of (potentially) parallel applications. Two major issues with these programming environments and tools relate to programmability and efficiency. Programmability is often impaired by the modest level of abstraction provided to the programmer. Efficiency more generally suffers from the peculiarities related to effective exploitation of the memory hierarchy. As a consequence, two distinct but synergistic needs exist: on the one hand, increasingly efficient mechanisms supporting correct concurrent access to shared memory data structures are needed; on the other hand there is a need for higher level programming environments capable of hiding the difficulties related to the correct and efficient use of shared memory objects by raising the level of abstraction provided to application programmers. To address these needs we introduce and discuss FastFlow, a programming framework specifically targeting cache-coherent shared-memory multi-cores. FastFlow is implemented as a stack of C++ template libraries. The lowest layer of FastFlow provides very efficient lock-free (and memory fence free) synchronization base mechanisms. The middle layer provides distinctive communication mechanisms supporting both single producer-multiple consumer and multiple producer-single consumer communications. These

[1]  Salvatore Ruggieri,et al.  YaDT: yet another decision tree builder , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[2]  Mario Leyton,et al.  Skandium: Multi-core Programming with Algorithmic Skeletons , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[3]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[4]  Massimo Torquati,et al.  Single-Producer/Single-Consumer Queues on Shared Cache Multi-Core Systems , 2010, ArXiv.

[5]  Timothy G. Mattson,et al.  Patterns for parallel programming , 2004 .

[6]  Christophe Dessimoz,et al.  SWPS3 – fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and ×86/SSE2 , 2008, BMC Research Notes.

[7]  Marco Vanneschi,et al.  The programming model of ASSIST, an environment for parallel and distributed portable applications , 2002, Parallel Comput..

[8]  Raymond H. Chan,et al.  Salt-and-pepper noise removal by median-type noise detectors and detail-preserving regularization , 2005, IEEE Transactions on Image Processing.

[9]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[10]  Michael I. Gordon,et al.  Language and Compiler Design for Streaming Applications , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[11]  Gilles Kahn,et al.  The Semantics of a Simple Language for Parallel Programming , 1974, IFIP Congress.

[12]  Basilio B. Fraguela,et al.  A Generic Algorithm Template for Divide-and-Conquer in Multicore Systems , 2010, 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC).

[13]  Herbert Kuchen,et al.  Enhancing Muesli's Data Parallel Skeletons for Multi-core Computer Architectures , 2010, 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC).

[14]  Maurice Herlihy,et al.  Wait-free synchronization , 1991, TOPL.

[15]  Herbert Kuchen,et al.  The Münster Skeleton Library Muesli: A comprehensive overview , 2009 .

[16]  Peter Kilpatrick,et al.  Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed , 2009, PARCO.

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Albert Cohen,et al.  Erbium: a deterministic, concurrent intermediate representation to map data-flow tasks to scalable, persistent streaming processes , 2010, CASES '10.

[19]  John Giacomoni,et al.  FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue , 2008, PPoPP.

[20]  Jeff Gilchrist Elytra PARALLEL DATA COMPRESSION WITH BZIP 2 , 2003 .

[21]  Marco Danelutto,et al.  Skeleton-based parallel programming: Functional and parallel semantics in a single shot , 2007, Comput. Lang. Syst. Struct..

[22]  Murray Cole,et al.  Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming , 2004, Parallel Comput..

[23]  Peter Kilpatrick,et al.  Skeletons for multi/many-core systems , 2009, PARCO.

[24]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[25]  Hideya Iwasaki,et al.  Parallel Skeletons for Variable-Length Lists in SkeTo Skeleton Library , 2009, Euro-Par.

[26]  James Demmel,et al.  the Parallel Computing Landscape , 2022 .

[27]  Massimo Torquati,et al.  Porting Decision Tree Algorithms to Multicore using FastFlow , 2010, ECML/PKDD.

[28]  Hermann Lederer,et al.  Parallel Computing: From Multicores and GPU's to Petascale , 2010 .

[29]  Pietro Liò,et al.  StochKit-FF: Efficient Systems Biology on Multicore Architectures , 2010, Euro-Par Workshops.

[30]  Murray Cole,et al.  Algorithmic Skeletons: Structured Management of Parallel Computation , 1989 .

[31]  Robert Stephens,et al.  A survey of stream processing , 1997, Acta Informatica.

[32]  Peter Kilpatrick,et al.  Accelerating Code on Multi-cores with FastFlow , 2011, Euro-Par.

[33]  Sven-Bodo Scholz,et al.  Semantics and Type Theory of S-Net. , 2006 .

[34]  Edward A. Lee,et al.  Dataflow process networks , 1995, Proc. IEEE.

[35]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[36]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[37]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .