An Improved Smith-Waterman Algorithm Based on Spark Parallelization

This paper proposes the design and the implementation of a Spark parallelization plan for improving the Smith-Waterman (SW) algorithm, named the Spark-OSW algorithm. Then, the Spark-OSW was verified through accuracy, performance and acceleration tests. The results show that the proposed algorithm achieved 100% accuracy, ran much faster than the SW, and performed well in cluster environment. The research findings shed important new light on the database search for gene sequences.

[1]  Bo Xu,et al.  Efficient Distributed Smith-Waterman Algorithm Based on Apache Spark , 2017, 2017 IEEE 10th International Conference on Cloud Computing (CLOUD).

[2]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[3]  Chengduan Wang,et al.  A Modified Machine Learning Method Used in Protein Prediction in Bioinformatics , 2015 .

[4]  Gabor T. Marth,et al.  SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications , 2012, PloS one.

[5]  Hong Yu,et al.  Heterogeneous Cloud Framework for Big Data Genome Sequencing , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Li Da-wei RESEARCH ON PARALLEL ALGORITHM OF SEQUENCE ALIGNMENT BASED ON DYNAMIC PROGRAMMING , 2011 .

[7]  Cheng Fang,et al.  Spark-based large-scale matrix inversion for big data processing , 2016, INFOCOM Workshops.

[8]  Chirag Jain,et al.  Fine-grained GPU parallelization of pairwise local sequence alignment , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[9]  Jeff Daily,et al.  Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments , 2016, BMC Bioinformatics.

[10]  Feng Yang,et al.  Bwasw-Cloud: Efficient sequence alignment algorithm for two big data with MapReduce , 2014, The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014).

[11]  Yongchao Liu,et al.  CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions , 2013, BMC Bioinformatics.

[12]  Jing Gao,et al.  Distributed Parallel Needleman-Wunsch Algorithm on Heterogeneous Cluster System , 2015, 2015 International Conference on Network and Information Systems for Computers.