Direct approaches to exploit many-core architecture in bioinformatics

Current trends in computer programming look for solutions in the challenging task of porting and optimizing existing algorithms to many-core architectures with tens of Central Processing Units (CPUs). Yet, the lack of standardized general-purpose parallel programming and porting methodologies represents the main bottleneck on these developments. We have focused on bioinformatics applied to genomics in general and the so-called "Next-Generation" Sequencing (NGS) in particular, in order to study the viability and cost of porting and optimizing well known algorithms to a many-core architecture. Three different methods are tackled in order to implement existing algorithms in Tile64, corresponding to a microprocessor containing 64 CPUs, each of them being capable of executing an independent Linux operating system. Three different approaches have been explored: (i) implementation of the Needleman-Wunsch/Smith-Waterman pairwise aligner from scratch; (ii) direct translation of the Message Passing Interface (MPI) C++ ABySS assembly algorithm with changes on the communication layer; and (iii) migration of the ClustalW tool, parallelizing only the most time-consuming stage. The performance-gain/development-cost tradeoffs indicate that the Tile64 microprocessor has the potential to increase the performance of bioinformatics in an unprecedented way for a standalone Personal Computer (PC). Yet, the effective exploitation of these parallel implementations requires a detailed understanding of the peculiar many-core characteristics when migrating previous non-parallel source codes. Highlights? Computing power of the Tile64 many-core microprocessor can be exploited for NGS bioinformatics tasks. ? Tile64 many-core CPU architecture works as a cluster of pico-computers, as with the MC64-NW/SW algorithm. ? MC64-ClustalW shows an important performance improvement with a minor development effort. ? MC64-ABySS reveals that a MPI-like efficient API for Tile64 is essential to port successfully most of the existing parallel code. ? Wide-spreading of many-core CPU technologies could lead to a new paradigm in programming methodologies in the next years.

[1]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.

[2]  Michael S. Farrar Optimizing Smith-Waterman for the Cell Broadband Engine , 2008 .

[3]  Bertil Schmidt,et al.  A hybrid architecture for bioinformatics , 2002, Future Gener. Comput. Syst..

[4]  Pilar Hernández,et al.  Genomic profiling of plastid DNA variation in the Mediterranean olive tree , 2011, BMC Plant Biology.

[5]  Habib Zaidi,et al.  Implementation of an Environment for Monte Carlo Simulation of Fully 3-D Positron Tomography on a High-Performance Parallel Platform , 1998, Parallel Comput..

[6]  Francisco José Esteban,et al.  Next-generation bioinformatics: using many-core processor architecture to develop a web service for sequence alignment , 2010, Bioinform..

[7]  Pedro Trancoso,et al.  Initial Experiences Porting a Bioinformatics Application to a Graphics Processor , 2005, Panhellenic Conference on Informatics.

[8]  Dongrui Fan,et al.  A Fast Linear-Space Sequence Alignment Algorithm with Dynamic Parallelization Framework , 2009, 2009 Ninth IEEE International Conference on Computer and Information Technology.

[9]  Witold R. Rudnicki,et al.  An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[10]  Katherine Geiersbach,et al.  Comparison of the Illumina Genome Analyzer and Roche 454 GS FLX for resequencing of hypertrophic cardiomyopathy-associated genes. , 2010, Journal of biomolecular techniques : JBT.

[11]  Jonathan Schaeffer,et al.  FastLSA: a fast, linear-space, parallel and sequential algorithm for sequence alignment , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[12]  Torbjørn Rognes,et al.  Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation , 2011, BMC Bioinformatics.

[13]  Torsten Hoefler,et al.  Mpi on Millions of Cores * , 2022 .

[14]  Klaus Schulten,et al.  Adapting a message-driven parallel application to GPU-accelerated clusters , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Roberto Gomperts,et al.  Performance Optimization of Clustal W : Parallel Clustal W , HT Clustal , and MULTICLUSTAL , 2001 .

[16]  Martin Vingron,et al.  Annotating regulatory DNA based on man-mouse genomic comparison , 2002, ECCB.

[17]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[18]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[19]  Mark J. P. Chaisson,et al.  De novo fragment assembly with short mate-paired reads: Does the read length matter? , 2009, Genome research.

[20]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[21]  Timothy G. Mattson,et al.  Programming the Intel 80-core network-on-a-chip Terascale Processor , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Jason N. Dale,et al.  Cell Broadband Engine Architecture and its first implementation - A performance view , 2007, IBM J. Res. Dev..

[23]  Kuo-Bin Li,et al.  ClustalW-MPI: ClustalW analysis using distributed and parallel computing , 2003, Bioinform..

[24]  N. J. Avis,et al.  An intelligent semi-automatic application porting system for application accelerators , 2009, UCHPC-MAW '09.

[25]  Anthony Skjellum,et al.  Using MPI - portable parallel programming with the message-parsing interface , 1994 .

[26]  Jim des Rivières,et al.  Eclipse: A platform for integrating development tools , 2004, IBM Syst. J..

[27]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[28]  Yao Zhang,et al.  Parallel Computing Experiences with CUDA , 2008, IEEE Micro.

[29]  Rudolf Eigenmann Toward a methodology of optimizing programs for high-performance computers , 1993, ICS '93.

[30]  Jack J. Dongarra,et al.  Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor , 2009, Parallel Comput..

[31]  J. Teich,et al.  Comparison of Parallelization Frameworks for Shared Memory Multi-Core Architectures , 2010 .

[32]  N. Gura,et al.  UltraSPARC T2: A highly-treaded, power-efficient, SPARC SOC , 2007, 2007 IEEE Asian Solid-State Circuits Conference.

[33]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[34]  Vaidy S. Sunderam,et al.  Performance of the NAS Parallel Benchmarks on PVM-Based Networks , 1995, J. Parallel Distributed Comput..

[35]  Ashwini K. Nanda,et al.  Cell/B.E. blades: Building blocks for scalable, real-time, interactive, and digital media servers , 2007, IBM J. Res. Dev..

[36]  Antonio Ruiz,et al.  Recognition of circular patterns on GPUs: Performance analysis and contributions , 2008, J. Parallel Distributed Comput..

[37]  Francisco José Esteban,et al.  Parallelizing and optimizing a bioinformatics pairwise sequence alignment algorithm for many-core architecture , 2011, Parallel Comput..

[38]  Jean-Marc Jézéquel,et al.  Model-driven engineering for software migration in a large industrial context , 2007, MODELS'07.

[39]  Italo Epicoco,et al.  A Bioinfomatics Grid Alignment Toolkit , 2008, Future Gener. Comput. Syst..

[40]  Boris D. Lubachevsky Synchronization barrier and related tools for shared memory parallel programming , 2005, International Journal of Parallel Programming.

[41]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[42]  Peter H. A. Sneath,et al.  Numerical Taxonomy: The Principles and Practice of Numerical Classification , 1973 .

[43]  Giorgio Valle,et al.  CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment , 2008, BMC Bioinformatics.

[44]  Gabriel Dorado,et al.  Next-generation sequencing and syntenic integration of flow-sorted arms of wheat chromosome 4A exposes the chromosome structure and gene content. , 2012, The Plant journal : for cell and molecular biology.

[45]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[46]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[47]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[48]  Yongchao Liu,et al.  CUDASW++2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions , 2010, BMC Research Notes.

[49]  Michael Kistler,et al.  Exploring the Viability of the Cell Broadband Engine for Bioinformatics Applications , 2007, IPDPS.

[50]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[51]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[52]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[53]  Cole Trapnell,et al.  Optimizing data intensive GPGPU computations for DNA sequence alignment , 2009, Parallel Comput..

[54]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[55]  John V. Harrison,et al.  Legacy 4GL application migration via knowledge-based software engineering technology: a case study , 1997, Proceedings of Australian Software Engineering Conference ASWEC 97.

[56]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[57]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[58]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..