Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends

The emergence of massive datasets in a clinical setting presents both challenges and opportunities in data storage and analysis. This so called “big data” challenges traditional analytic tools and will increasingly require novel solutions adapted from other fields. Advances in information and communication technology present the most viable solutions to big data analysis in terms of efficiency and scalability. It is vital those big data solutions are multithreaded and that data access approaches be precisely tailored to large volumes of semi-structured/unstructured data.The MapReduce programming framework uses two tasks common in functional programming: Map and Reduce. MapReduce is a new parallel processing framework and Hadoop is its open-source implementation on a single computing node or on clusters. Compared with existing parallel processing paradigms (e.g. grid computing and graphical processing unit (GPU)), MapReduce and Hadoop have two advantages: 1) fault-tolerant storage resulting in reliable data processing by replicating the computing tasks, and cloning the data chunks on different computing nodes across the computing cluster; 2) high-throughput data processing via a batch processing framework and the Hadoop distributed file system (HDFS). Data are stored in the HDFS and made available to the slave nodes for computation.In this paper, we review the existing applications of the MapReduce programming framework and its implementation platform Hadoop in clinical big data and related medical health informatics fields. The usage of MapReduce and Hadoop on a distributed system represents a significant advance in clinical big data processing and utilization, and opens up new opportunities in the emerging era of big data analytics. The objective of this paper is to summarize the state-of-the-art efforts in clinical big data analytics and highlight what might be needed to enhance the outcomes of clinical big data analytics tools. This paper is concluded by summarizing the potential usage of the MapReduce programming framework and Hadoop platform to process huge volumes of clinical data in medical health informatics related fields.

[1]  D. E. Bell,et al.  Decision making: Descriptive, normative, and prescriptive interactions. , 1990 .

[2]  G. Zanetti,et al.  Parallelizing bioinformatics applications with MapReduce , 2008 .

[3]  Xiaojing Jia Google Cloud Computing Platform Technology Architecture and the Impact of Its Cost , 2010, 2010 Second World Congress on Software Engineering.

[4]  Ahmed E. Youssef AFRAMEWORK FOR SECURE H EALTHCARE SYSTEMS BASED ON BIGDATA ANALYTICS IN M OBILE CLOUD COMPUTING ENVIRONMENTS , 2014 .

[5]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[6]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[7]  Henning Hermjakob,et al.  Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework , 2012, BMC Bioinformatics.

[8]  Simon White,et al.  Launching genomics into the cloud: deployment of Mercury, a next generation sequence analysis pipeline , 2014, BMC Bioinformatics.

[9]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[10]  Brendan MacLean,et al.  General framework for developing and evaluating database scoring algorithms using the TANDEM search engine , 2006, Bioinform..

[11]  D. Bates,et al.  Clinical Decision Support Systems , 1999, Health Informatics.

[12]  J. Qi,et al.  Whole genome molecular phylogeny of large dsDNA viruses using composition vector method , 2007, BMC Evolutionary Biology.

[13]  Yao Sun,et al.  HBase, MapReduce, and Integrated Data Visualization for Processing Clinical Signal Data , 2011, AAAI Spring Symposium: Computational Physiology.

[14]  Jing Wang,et al.  Evaluation and integration of existing methods for computational prediction of allergens , 2013, BMC Bioinformatics.

[15]  Michela Taufer,et al.  Enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce , 2013, BMC Structural Biology.

[16]  Wu-chun Feng,et al.  The design, implementation, and evaluation of mpiBLAST , 2003 .

[17]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[18]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[19]  Ben Niu,et al.  Biomimicry of quorum sensing using bacterial lifecycle model , 2013, BMC Bioinformatics.

[20]  Edmund Kohlwey,et al.  Leveraging the Cloud for Big Data Biometrics: Meeting the Performance Requirements of the Next Generation Biometric Systems , 2011, 2011 IEEE World Congress on Services.

[21]  Judy Qiu,et al.  Cloud Technologies for Bioinformatics Applications , 2011, IEEE Trans. Parallel Distributed Syst..

[22]  Wolfgang Maass,et al.  Fractal MapReduce decomposition of sequence alignment , 2012, Algorithms for Molecular Biology.

[23]  Randal E. Bryant,et al.  Data-Intensive Supercomputing: The case for DISC , 2007 .

[24]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[25]  Lynette Hirschman,et al.  Nephele: genotyping via complete composition vectors and MapReduce , 2011, Source Code for Biology and Medicine.

[26]  S. Herculano‐Houzel The Human Brain in Numbers: A Linearly Scaled-up Primate Brain , 2009, Front. Hum. Neurosci..

[27]  Chun-Yuan Lin,et al.  A P2P Framework for Developing Bioinformatics Applications in Dynamic Cloud Environments , 2013, International journal of genomics.

[28]  E. Shortliffe Clinical decision-support systems , 1990 .

[29]  N. S. Raghava,et al.  Iris recognition on Hadoop: A biometrics system implementation on cloud computing , 2011, 2011 IEEE International Conference on Cloud Computing and Intelligence Systems.

[30]  Lawrence D. Fu,et al.  Identifying Unproven Cancer Treatments on the Health Web: Addressing Accuracy, Generalizability and Scalability , 2013, MedInfo.

[31]  L. Feldkamp,et al.  Practical cone-beam algorithm , 1984 .

[32]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[33]  D. E. Bell,et al.  Decision Making: DESCRIPTIVE, NORMATIVE, AND PRESCRIPTIVE INTERACTIONS IN DECISION MAKING , 1988 .

[34]  R. Ilmoniemi,et al.  Magnetoencephalography-theory, instrumentation, and applications to noninvasive studies of the working human brain , 1993 .

[35]  Simon L. Peyton Jones,et al.  The Implementation of Functional Programming Languages , 1987 .

[36]  G. Kumar,et al.  The Association of Lacking Insurance With Outcomes of Severe Sepsis: Retrospective Analysis of an Administrative Database* , 2014, Critical care medicine.

[37]  H. Braak,et al.  Neuropathological stageing of Alzheimer-related changes , 2004, Acta Neuropathologica.

[38]  Sandeep Tata,et al.  BlueSNP: R package for highly scalable genome-wide association studies using Hadoop clusters , 2013, Bioinform..

[39]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[40]  Thomas L. Madden,et al.  BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. , 1999, FEMS microbiology letters.

[41]  Marcin Mazurek,et al.  Applying NoSQL Databases for Operationalizing Clinical Data Mining Models , 2014, BDAS.

[42]  Jan-Ming Ho,et al.  De Novo Assembly of High-Throughput Sequencing Data with Cloud Computing and New Operations on String Graphs , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[43]  G. Sudha Sadasivam,et al.  A novel approach to multiple sequence alignment using hadoop data grids , 2010, MDAC '10.

[44]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[45]  Limsoon Wong,et al.  Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes , 2013, BMC Bioinformatics.

[46]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[47]  Lena Mamykina,et al.  The future state of clinical data capture and documentation: a report from AMIA's 2011 Policy Meeting , 2013, J. Am. Medical Informatics Assoc..

[48]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[49]  Chia-Feng Juang,et al.  A hybrid of genetic algorithm and particle swarm optimization for recurrent network design , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[50]  George Coulouris,et al.  Distributed systems - concepts and design , 1988 .

[51]  J. Vincent,et al.  Annual Update in Intensive Care and Emergency Medicine 2023 , 2014, Annual Update in Intensive Care and Emergency Medicine.

[52]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[53]  Shanrong Zhao,et al.  Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing , 2013, BMC Genomics.

[54]  S. Shuman,et al.  Structure, mechanism, and evolution of the mRNA capping apparatus. , 2001, Progress in nucleic acid research and molecular biology.

[55]  Sebti Foufou,et al.  Cloud-Ready Biometric System for Mobile Security Access , 2012, NDT.

[56]  D. Purves Body and Brain: A Trophic Theory of Neural Connections , 1988 .

[57]  C. Friedman,et al.  A drug-adverse event extraction algorithm to support pharmacovigilance knowledge mining from PubMed citations. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[58]  Baomin Xu,et al.  An efficient algorithm for DNA fragment assembly in MapReduce. , 2012, Biochemical and biophysical research communications.

[59]  M. Jonas,et al.  Patient Identification, A Review of the Use of Biometrics in the ICU , 2014 .

[60]  Perry L. Miller,et al.  The Human Brain Project: neuroinformatics tools for integrating, searching and modeling multidisciplinary neuroscience data , 1998, Trends in Neurosciences.

[61]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[62]  Simon Peyton Jones,et al.  The Implementation of Functional Programming Languages (Prentice-hall International Series in Computer Science) , 1987 .

[63]  Viju Raghupathi,et al.  Big data analytics in healthcare: promise and potential , 2014, Health Information Science and Systems.

[64]  Kai Wang,et al.  BioPig: a Hadoop-based analytic toolkit for large-scale sequence data , 2013, Bioinform..

[65]  Miguel Branco Distributed data management for large scale applications , 2009 .

[66]  Yaw-Ling Lin,et al.  Implementation of a Parallel Protein Structure Alignment Service on Cloud , 2013, International journal of genomics.

[67]  Henning Müller,et al.  Using MapReduce for Large-Scale Medical Image Analysis , 2012, 2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology.

[68]  M. Porter,et al.  The Big Idea : How to Solve the Cost Crisis in Health Care , 2012 .

[69]  Lei Xing,et al.  Ultrafast and scalable cone-beam CT reconstruction using MapReduce in a cloud computing environment. , 2011, Medical physics.

[70]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[71]  P. K. Mudholkar,et al.  Cloud computing and its applications , 2011, ICWET.

[72]  Rajiv Ranjan,et al.  Parallel Processing of Massive EEG Data with MapReduce , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[73]  Dariusz Mrozek,et al.  Beyond Databases, Architectures, and Structures , 2014, Communications in Computer and Information Science.

[74]  Dwight R. Bean,et al.  Recursive Euler and Hamilton paths , 1976 .

[75]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[76]  Dimitris Margaritis,et al.  Speculative Markov blanket discovery for optimal feature selection , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[77]  Günther Specht,et al.  Cloudgene: A graphical execution platform for MapReduce programs on private and public clouds , 2012, BMC Bioinformatics.

[78]  Kui Zhang,et al.  Dynamic programming algorithms for haplotype block partitioning: applications to human chromosome 21 haplotype data , 2003, RECOMB '03.

[79]  Darcy A. Davis,et al.  Bringing Big Data to Personalized Healthcare: A Patient-Centered Framework , 2013, Journal of General Internal Medicine.

[80]  Terence T. Ow,et al.  Examining the impact of information technology and patient flow on healthcare performance: A Theory of Swift and Even Flow (TSEF) perspective , 2013 .

[81]  Winfried Schlee,et al.  Top-Down Modulation of the Auditory Steady-State Response in a Task-Switch Paradigm , 2008, Front. Hum. Neurosci..

[82]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[83]  Tim Kindberg,et al.  Distributed Systems: Concepts and Design (4th Edition) (International Computer Science) , 2005 .

[84]  P. Bramanti,et al.  The emerging role for chemokines in epilepsy , 2010, Journal of Neuroimmunology.

[85]  Yu-Ting Hsiao,et al.  Designing a parallel evolutionary algorithm for inferring gene networks on the cloud computing environment , 2014, BMC Systems Biology.

[86]  Che-Lun Hung,et al.  Novel and efficient tag SNPs selection algorithms. , 2014, Bio-medical materials and engineering.

[87]  Norden E. Huang,et al.  Ensemble Empirical Mode Decomposition: a Noise-Assisted Data Analysis Method , 2009, Adv. Data Sci. Adapt. Anal..

[88]  Kazuhiko Ohe,et al.  A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script , 2012, BMC Medical Informatics and Decision Making.

[89]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[90]  Ben Langmead,et al.  Genotyping in the Cloud with Crossbow , 2012, Current protocols in bioinformatics.

[91]  Geoffrey C. Fox,et al.  IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 1 Cloud Technologies for Bioinformatics Applications , 2022 .

[92]  Mark Barnes,et al.  Preparing for responsible sharing of clinical trial data. , 2013, The New England journal of medicine.

[93]  Jingfa Xiao,et al.  Bioinformatics clouds for big data manipulation , 2012, Biology Direct.

[94]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .