Using Distributed Data over HBase in Big Data Analytics Platform for Clinical Services

Big data analytics (BDA) is important to reduce healthcare costs. However, there are many challenges of data aggregation, maintenance, integration, translation, analysis, and security/privacy. The study objective to establish an interactive BDA platform with simulated patient data using open-source software technologies was achieved by construction of a platform framework with Hadoop Distributed File System (HDFS) using HBase (key-value NoSQL database). Distributed data structures were generated from benchmarked hospital-specific metadata of nine billion patient records. At optimized iteration, HDFS ingestion of HFiles to HBase store files revealed sustained availability over hundreds of iterations; however, to complete MapReduce to HBase required a week (for 10 TB) and a month for three billion (30 TB) indexed patient records, respectively. Found inconsistencies of MapReduce limited the capacity to generate and replicate data efficiently. Apache Spark and Drill showed high performance with high usability for technical support but poor usability for clinical services. Hospital system based on patient-centric data was challenging in using HBase, whereby not all data profiles were fully integrated with the complex patient-to-hospital relationships. However, we recommend using HBase to achieve secured patient data while querying entire hospital volumes in a simplified clinical event model across clinical services.

[1]  L. Ohno-Machado,et al.  “Big Data” and the Electronic Health Record , 2014, Yearbook of Medical Informatics.

[2]  Kathleen M. Crowther :The Science of Describing: Natural History in Renaissance Europe , 2008 .

[3]  James M. Tien,et al.  Big Data: Unleashing information , 2013, 2013 10th International Conference on Service Systems and Service Management.

[4]  Dillon Chrimes,et al.  Operational Efficiencies and Simulated Performance of Big Data Analytics Platform over Billions of Patient Records of a Hospital System , 2017 .

[5]  Christopher G. Chute,et al.  Medical Concept Representation , 2005 .

[6]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[7]  Byeong-Soo Jeong,et al.  An Efficient Distributed Programming Model for Mining Useful Patterns in Big Datasets , 2013 .

[8]  Chao-Tung Yang,et al.  Implementation of Data Transform Method into NoSQL Database for Healthcare Data , 2013, 2013 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[9]  Adam Jorgensen,et al.  Microsoft Big Data Solutions , 2014 .

[10]  Kimberlyn M. McGrail,et al.  Privacy by Design at Population Data BC: a case study describing the technical, administrative, and physical controls for privacy-sensitive secondary use of personal information for research in the public interest , 2013, J. Am. Medical Informatics Assoc..

[11]  Clarence J M Tauro,et al.  Comparative Study of the New Generation, Agile, Scalable, High Performance NOSQL Databases , 2012 .

[12]  Reinhold Haux,et al.  Medical informatics: Past, present, future , 2010, Int. J. Medical Informatics.

[13]  Laurie D. Smith,et al.  A 26-hour system of highly sensitive whole genome sequencing for emergency management of genetic diseases , 2015, Genome Medicine.

[14]  Neil A. Miller,et al.  Constellation: a tool for rapid, automated phenotype assignment of a highly polymorphic pharmacogene, CYP2D6, from whole-genome sequences , 2016, npj Genomic Medicine.

[15]  Peter Saffrey,et al.  Rapid Whole-Genome Sequencing for Genetic Disease Diagnosis in Neonatal Intensive Care Units , 2012, Science Translational Medicine.

[16]  Laura B. Madsen Data-Driven Healthcare: How Analytics and BI are Transforming the Industry , 2014 .

[17]  Ge Zhang,et al.  A new fragment re-allocation strategy for NoSQL database systems , 2014, Frontiers of Computer Science.

[18]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[19]  Régis Beuscart,et al.  Toward a Literature-Driven Definition of Big Data in Healthcare , 2015, BioMed research international.

[20]  Murat Kantarcioglu,et al.  BigSecret: A Secure Data Management Framework for Key-Value Stores , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[21]  Adam Lith,et al.  Investigating storage solutions for large data - A comparison of well performing and scalable data storage solutions for real time extraction and batch insertion of data , 2010 .

[22]  Jie Xu,et al.  ZQL: A Unified Middleware Bridging Both Relational and NoSQL Databases , 2016, 2016 IEEE 14th Intl Conf on Dependable, Autonomic and Secure Computing, 14th Intl Conf on Pervasive Intelligence and Computing, 2nd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech).

[23]  Yingjie Wang,et al.  mDHT: a multi-level-indexed DHT algorithm to extra-large-scale data retrieval on HDFS/Hadoop architecture , 2014, Personal and Ubiquitous Computing.

[24]  Yu Tian,et al.  Design and Development of a Medical Big Data Processing System Based on Hadoop , 2015, Journal of Medical Systems.

[25]  Keun Ho Ryu,et al.  Design and Partial Implementation of Health Care System for Disease Detection and Behavior Analysis by Using DM Techniques , 2016, 2016 IEEE 14th Intl Conf on Dependable, Autonomic and Secure Computing, 14th Intl Conf on Pervasive Intelligence and Computing, 2nd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech).

[26]  Maziar Goudarzi,et al.  The Memory Challenge in Reduce Phase of MapReduce Applications , 2016, IEEE Transactions on Big Data.

[27]  John Gantz,et al.  The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East , 2012 .

[28]  Hsinchun Chen,et al.  Knowledge Management, Data Mining, and Text Mining in Medical Informatics , 2005 .

[29]  Wei Hu,et al.  Towards a real-time big data analytics platform for health applications , 2017, Int. J. Big Data Intell..

[30]  Frans Coenen DOI: 10.1017/S000000000000000 Printed in the United Kingdom Data Mining: Past, Present and Future , 2022 .

[31]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[32]  Basel Kayyali,et al.  The big-data revolution in US health care : Accelerating value and innovation April 2013 , 2013 .

[33]  Yike Guo,et al.  High dimensional biological data retrieval optimization with NoSQL technology , 2014, BMC Genomics.

[34]  M M Hansen,et al.  Big Data in Science and Healthcare: A Review of Recent Literature and Perspectives , 2014, Yearbook of Medical Informatics.

[35]  Elizabeth M. Borycki,et al.  A Comparison of National Health Data Interoperability Approaches in Taiwan, Denmark and Canada , 2011 .

[36]  Robert Hoyt,et al.  Digital family histories for data mining. , 2013, Perspectives in health information management.

[37]  Sherif Sakr,et al.  Towards a Comprehensive Data Analytics Framework for Smart Healthcare Services , 2016, Big Data Res..

[38]  P. O'Sullivan,et al.  Applying data models to big data architectures , 2014, IBM J. Res. Dev..

[39]  Hans De Sterck,et al.  Supporting multi-row distributed transactions with global snapshot isolation using bare-bones HBase , 2010, 2010 11th IEEE/ACM International Conference on Grid Computing.

[40]  Darcy A. Davis,et al.  Bringing Big Data to Personalized Healthcare: A Patient-Centered Framework , 2013, Journal of General Internal Medicine.

[41]  Sungchul Choi,et al.  Big Data Framework for Analyzing Patents to Support Strategic R&D Planning , 2016, 2016 IEEE 14th Intl Conf on Dependable, Autonomic and Secure Computing, 14th Intl Conf on Pervasive Intelligence and Computing, 2nd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech).

[42]  M. Eric Johnson,et al.  Usability Failures and Healthcare Data Hemorrhages , 2011, IEEE Security & Privacy.

[43]  S de Lusignan,et al.  Big Data Usage Patterns in the Health Care Domain: A Use Case Driven Approach Applied to the Assessment of Vaccination Benefits and Risks , 2014, Yearbook of Medical Informatics.

[44]  Emad A. Mohammed,et al.  Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends , 2014, BioData Mining.

[45]  Muhammad Shiraz,et al.  Big Data: Survey, Technologies, Opportunities, and Challenges , 2014, TheScientificWorldJournal.

[46]  Daniel M. Batista,et al.  A Survey of Large Scale Data Management Approaches in Cloud Environments , 2011, IEEE Communications Surveys & Tutorials.

[47]  Limsoon Wong,et al.  Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes , 2013, BMC Bioinformatics.

[48]  Syed Akhter Hossain,et al.  NoSQL Database: New Era of Databases for Big data Analytics - Classification, Characteristics and Comparison , 2013, ArXiv.

[49]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[50]  INPUT SPLIT FREQUENT PATTERN TREE USING MAPREDUCE PARADIGM IN HADOOP , 2016 .