Distributed gene clinical decision support system based on cloud computing

The clinical decision support system can effectively solve the limitations of doctors' knowledge, reduce misdiagnosis and help enhance health. The traditional genetic data storage and analysis technology based on the stand-alone environment have limited scalability, which has been difficult to meet the computational requirements of rapid genetic data growth. In this paper, we propose a distributed gene clinical decision support system, which is named as GCDSS. We implemented a prototype based on cloud computing. To speed up the data processing of GCDSS, we present a novel distributed read mapping algorithm CloudBWA that leverages batch processing strategy to map reads on Apache Spark. Evaluations show that GCDSS and its component CloudBWA achieve outstanding performance and excellent scalability. Compared with distributed algorithms, CloudBWA achieves up to 2.63 times speedup over SparkBWA.

[1]  Zhao Zhang,et al.  Rethinking Data-Intensive Science Using Scalable Analytics Systems , 2015, SIGMOD Conference.

[2]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[3]  Yu-Ting Chen,et al.  Memory System Optimizations for Customized Computing - From Single-Chip to Datacenter , 2016 .

[4]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[5]  Hong Yu,et al.  Heterogeneous Cloud Framework for Big Data Genome Sequencing , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Bo Xu,et al.  Efficient Distributed Smith-Waterman Algorithm Based on Apache Spark , 2017, 2017 IEEE 10th International Conference on Cloud Computing (CLOUD).

[7]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[8]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[9]  Bo Xu,et al.  DSA: Scalable Distributed Sequence Alignment System Using SIMD Instructions , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[10]  François Schiettecatte,et al.  OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders , 2014, Nucleic Acids Res..

[11]  Ronnie Rodrigues Pereira,et al.  Identifying Potential cis-Regulatory Variants Associated with Allele-Specific Expression , 2016 .

[12]  Xi Li,et al.  GenServ: Genome Sequencing Services on Scalable Energy Efficient Accelerators , 2017, 2017 IEEE International Conference on Web Services (ICWS).

[13]  P. Spellman,et al.  CS-BWAMEM : A fast and scalable read aligner at the cloud scale for whole genome sequencing , 2015 .

[14]  Jorge Amigo,et al.  SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data , 2016, PloS one.

[15]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[16]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[17]  Chao Wang,et al.  Big data genome sequencing on Zynq based clusters (abstract only) , 2014, FPGA.

[18]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[19]  Jing Zhang,et al.  The real cost of sequencing: scaling computation to keep pace with data generation , 2016, Genome biology.

[20]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[21]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[22]  David A. Patterson,et al.  ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing , 2013 .

[23]  Frank A. Nothaft Scalable Genome Resequencing with ADAM and avocado by , 2015 .

[24]  Chao Wang,et al.  Genome sequencing using mapreduce on FPGA with multiple hardware accelerators (abstract only) , 2013, FPGA '13.

[25]  Richard M. Karp,et al.  Faster and More Accurate Sequence Alignment with SNAP , 2011, ArXiv.

[26]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[27]  Günther Specht,et al.  Cloudflow - A framework for MapReduce pipeline development in Biomedical Research , 2015, 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[28]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[29]  Mahmoud Parsian,et al.  Data Algorithms: Recipes for Scaling Up with Hadoop and Spark , 2015 .