Operational Efficiencies and Simulated Performance of Big Data Analytics Platform over Billions of Patient Records of a Hospital System

A R T I C L E I N F O A B S T R A C T Article history: Received: 30 November, 2016 Accepted: 12 January, 2017 Online: 28 January, 2017 Big Data Analytics (BDA) is important to utilize data from hospital systems to reduce healthcare costs. BDA enable queries of large volumes of patient data in an interactively dynamic way for healthcare. The study objective was high performance establishment of interactive BDA platform of hospital system. A Hadoop/MapReduce framework was established at University of Victoria (UVic) with Compute Canada/Westgrid to form a Healthcare BDA (HBDA) platform with HBase (NoSQL database) using hospital-specific metadata and file ingestion. Patient data profiles and clinical workflow derived from Vancouver Island Health Authority (VIHA), Victoria, BC, Canada. The proof-of-concept implementation tested patient data representative of the entire Provincial hospital systems. We cross-referenced all data profiles and metadata with real patient data used in clinical reporting. Query performance tested Apache tools in Hadoop’s ecosystem. At optimized iteration, Hadoop Distributed File System (HDFS) ingestion required three seconds but HBase required four to twelve hours to complete the Reducer of MapReduce. HBase bulkloads took a week for one billion (10TB) and over two months for three billion (30TB). Simple and complex query results showed about two seconds for one and three billion, respectively. Apache Drill outperformed Apache Spark. However, it was restricted to running more simplified queries with poor usability for healthcare. Jupyter on Spark offered high performance and customization to run all queries simultaneously with high usability. BDA platform of HBase distributed over Hadoop successfully; however, some inconsistencies of MapReduce limited operational efficiencies. Importance of Hadoop/MapReduce on representation of platform performance discussed.

[1]  Peter Saffrey,et al.  Rapid Whole-Genome Sequencing for Genetic Disease Diagnosis in Neonatal Intensive Care Units , 2012, Science Translational Medicine.

[2]  Kayvan Najarian,et al.  Big Data Analytics in Healthcare , 2015, BioMed research international.

[3]  Clarence J M Tauro,et al.  Comparative Study of the New Generation, Agile, Scalable, High Performance NOSQL Databases , 2012 .

[4]  Julia Adler-Milstein,et al.  Healthcare's "big data" challenge. , 2013, The American journal of managed care.

[5]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[6]  Lorrie Faith Cranor,et al.  Engineering Privacy , 2009, IEEE Transactions on Software Engineering.

[7]  Jie Xu,et al.  ZQL: A Unified Middleware Bridging Both Relational and NoSQL Databases , 2016, 2016 IEEE 14th Intl Conf on Dependable, Autonomic and Secure Computing, 14th Intl Conf on Pervasive Intelligence and Computing, 2nd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech).

[8]  Wei Hu,et al.  Towards a real-time big data analytics platform for health applications , 2017, Int. J. Big Data Intell..

[9]  Limsoon Wong,et al.  Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes , 2013, BMC Bioinformatics.

[10]  Syed Akhter Hossain,et al.  NoSQL Database: New Era of Databases for Big data Analytics - Classification, Characteristics and Comparison , 2013, ArXiv.

[11]  Vijay H. Kothari,et al.  Workarounds to Computer Access in Healthcare Organizations: You Want My Password or a Dead Patient? , 2015, ITCH.

[12]  Louis P Garrison,et al.  Universal health coverage--big thinking versus big data. , 2013, Value in health : the journal of the International Society for Pharmacoeconomics and Outcomes Research.

[13]  Yike Guo,et al.  High dimensional biological data retrieval optimization with NoSQL technology , 2014, BMC Genomics.

[14]  M M Hansen,et al.  Big Data in Science and Healthcare: A Review of Recent Literature and Perspectives , 2014, Yearbook of Medical Informatics.

[15]  Tony R. Sahama,et al.  Health big data analytics: current perspectives, challenges and potential solutions , 2014, Int. J. Big Data Intell..

[16]  Emad A. Mohammed,et al.  Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends , 2014, BioData Mining.

[17]  Sameer Kumar,et al.  HIPAA's effects on US healthcare. , 2009, International journal of health care quality assurance.

[18]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[19]  Tin Yu Wu,et al.  Towards a framework for large-scale multimedia data storage and processing on Hadoop platform , 2013, The Journal of Supercomputing.

[20]  Murat Kantarcioglu,et al.  BigSecret: A Secure Data Management Framework for Key-Value Stores , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[21]  Byeong-Soo Jeong,et al.  An Efficient Distributed Programming Model for Mining Useful Patterns in Big Datasets , 2013 .

[22]  Veda C. Storey,et al.  Business Intelligence and Analytics: From Big Data to Big Impact , 2012, MIS Q..

[23]  Viju Raghupathi,et al.  Big data analytics in healthcare: promise and potential , 2014, Health Information Science and Systems.

[24]  Anwitaman Datta,et al.  Multiterm Keyword Search in NoSQL Systems , 2012, IEEE Internet Computing.

[25]  Randy H. Katz,et al.  How Hadoop Clusters Break , 2013, IEEE Software.

[26]  Jianling Sun,et al.  Scalable RDF store based on HBase and MapReduce , 2010, 2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE).

[27]  Darcy A. Davis,et al.  Bringing Big Data to Personalized Healthcare: A Patient-Centered Framework , 2013, Journal of General Internal Medicine.

[28]  Sungchul Choi,et al.  Big Data Framework for Analyzing Patents to Support Strategic R&D Planning , 2016, 2016 IEEE 14th Intl Conf on Dependable, Autonomic and Secure Computing, 14th Intl Conf on Pervasive Intelligence and Computing, 2nd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech).

[29]  Anurag Barthwal,et al.  Big Data Analytics using Hadoop , 2014 .

[30]  Ge Zhang,et al.  A new fragment re-allocation strategy for NoSQL database systems , 2014, Frontiers of Computer Science.

[31]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[32]  Régis Beuscart,et al.  Toward a Literature-Driven Definition of Big Data in Healthcare , 2015, BioMed research international.

[33]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[34]  M Markus Maier,et al.  Towards a big data reference architecture , 2013 .

[35]  Maziar Goudarzi,et al.  The Memory Challenge in Reduce Phase of MapReduce Applications , 2016, IEEE Transactions on Big Data.

[36]  Sherif Sakr,et al.  Towards a Comprehensive Data Analytics Framework for Smart Healthcare Services , 2016, Big Data Res..

[37]  Jeanne Erdmann,et al.  As personal genomes join big data will privacy and access shrink? , 2013, Chemistry & biology.

[38]  Kai Wang,et al.  BioPig: a Hadoop-based analytic toolkit for large-scale sequence data , 2013, Bioinform..

[39]  Ruay-Shiung Chang,et al.  Dynamic Deduplication Decision in a Hadoop Distributed File System , 2014, Int. J. Distributed Sens. Networks.

[40]  Neil A. Miller,et al.  Constellation: a tool for rapid, automated phenotype assignment of a highly polymorphic pharmacogene, CYP2D6, from whole-genome sequences , 2016, npj Genomic Medicine.

[41]  James M. Tien,et al.  Big Data: Unleashing information , 2013, 2013 10th International Conference on Service Systems and Service Management.

[42]  Divyakant Agrawal,et al.  $\mathcal{MD}$-HBase: design and implementation of an elastic data infrastructure for cloud-scale location services , 2012, Distributed and Parallel Databases.

[43]  Tom White,et al.  Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale (4. ed., revised & updated) , 2012 .

[44]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[45]  Peter Langkafel Big Data in Medical Science and Healthcare Management , 2015 .

[46]  Dillon Chrimes,et al.  Interactive Healthcare Big Data Analytics Platform under Simulated Performance , 2016, 2016 IEEE 14th Intl Conf on Dependable, Autonomic and Secure Computing, 14th Intl Conf on Pervasive Intelligence and Computing, 2nd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech).

[47]  L. Ohno-Machado,et al.  “Big Data” and the Electronic Health Record , 2014, Yearbook of Medical Informatics.

[48]  Yi Mu,et al.  Personal Health Record Systems and Their Security Protection , 2006, Journal of Medical Systems.

[49]  Alexandros Labrinidis,et al.  Challenges and Opportunities with Big Data , 2012, Proc. VLDB Endow..

[50]  Laurie D. Smith,et al.  A 26-hour system of highly sensitive whole genome sequencing for emergency management of genetic diseases , 2015, Genome Medicine.

[51]  Yao Sun,et al.  HBase, MapReduce, and Integrated Data Visualization for Processing Clinical Signal Data , 2011, AAAI Spring Symposium: Computational Physiology.

[52]  Hans De Sterck,et al.  Supporting multi-row distributed transactions with global snapshot isolation using bare-bones HBase , 2010, 2010 11th IEEE/ACM International Conference on Grid Computing.

[53]  Jan-Ming Ho,et al.  De Novo Assembly of High-Throughput Sequencing Data with Cloud Computing and New Operations on String Graphs , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[54]  Erik Sundvall,et al.  Comparing the Performance of NoSQL Approaches for Managing Archetype-Based Electronic Health Record Data , 2016, PloS one.

[55]  Yeh-Ching Chung,et al.  JackHare: a framework for SQL to NoSQL translation using MapReduce , 2013, Automated Software Engineering.

[56]  Che-Rung Lee,et al.  Performance Optimization of the SSVD Collaborative Filtering Algorithm on MapReduce Architectures , 2016, 2016 IEEE 14th Intl Conf on Dependable, Autonomic and Secure Computing, 14th Intl Conf on Pervasive Intelligence and Computing, 2nd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech).

[57]  Dursun Delen,et al.  Leveraging the capabilities of service-oriented decision support systems: Putting analytics and big data in cloud , 2013, Decis. Support Syst..

[58]  Keun Ho Ryu,et al.  Design and Partial Implementation of Health Care System for Disease Detection and Behavior Analysis by Using DM Techniques , 2016, 2016 IEEE 14th Intl Conf on Dependable, Autonomic and Secure Computing, 14th Intl Conf on Pervasive Intelligence and Computing, 2nd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech).

[59]  INPUT SPLIT FREQUENT PATTERN TREE USING MAPREDUCE PARADIGM IN HADOOP , 2016 .

[60]  John Gantz,et al.  The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East , 2012 .

[61]  Nigam H. Shah,et al.  The coming age of data-driven medicine: translational bioinformatics' next frontier , 2012, J. Am. Medical Informatics Assoc..

[62]  Adam Lith,et al.  Investigating storage solutions for large data - A comparison of well performing and scalable data storage solutions for real time extraction and batch insertion of data , 2010 .

[63]  Wei Hu,et al.  Design and Construction of a Big Data Analytics Framework for Health Applications , 2015, 2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity).

[64]  Hsinchun Chen,et al.  Knowledge Management, Data Mining, and Text Mining in Medical Informatics , 2005 .