Collaborative Cloud Computing Framework for Health Data with Open Source Technologies

The proliferation of sensor technologies and advancements in data collection methods have enabled the accumulation of very large amounts of data. Increasingly, these datasets are considered for scientific research. However, the design of the system architecture to achieve high performance in terms of parallelization, query processing time, aggregation of heterogeneous data types (e.g., time series, images, structured data, among others), and difficulty in reproducing scientific research remain a major challenge. This is specifically true for health sciences research, where the systems must be i) easy to use with the flexibility to manipulate data at the most granular level, ii) agnostic of programming language kernel, iii) scalable, and iv) compliant with the HIPAA privacy law. In this paper, we review the existing literature for such big data systems for scientific research in health sciences and identify the gaps of the current system landscape. We propose a novel architecture for software-hardware-data ecosystem using open source technologies such as Apache Hadoop, Kubernetes and JupyterHub in a distributed environment. We also evaluate the system using a large clinical data set of 69M patients.

[1]  I. Kohane,et al.  Rcupcake: an R package for querying and analyzing biomedical data through the BD2K PIC-SURE RESTful API , 2017, Bioinform..

[2]  Yunjun Gao,et al.  UlTraMan: A Unified Platform for Big Trajectory Data Management and Analytics , 2018, Proc. VLDB Endow..

[3]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[4]  Sidi Ahmed Mahmoudi,et al.  Cloud architecture for digital phenotyping and automation , 2017, 2017 3rd International Conference of Cloud Computing Technologies and Applications (CloudTech).

[5]  Yu-Chuan Li,et al.  Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers , 2015, MedInfo.

[6]  Cartik R. Kothari,et al.  A database of human exposomes and phenomes from the US National Health and Nutrition Examination Survey , 2016, Scientific Data.

[7]  N. B. Anuar,et al.  The rise of "big data" on cloud computing: Review and open research issues , 2015, Inf. Syst..

[8]  Dean F Sittig,et al.  A Survey of Informatics Platforms That Enable Distributed Comparative Effectiveness Research Using Multi-institutional Heterogenous Clinical Data , 2012, Medical care.

[9]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[10]  John P. A. Ioannidis,et al.  What does research reproducibility mean? , 2016, Science Translational Medicine.

[11]  Franz Porzsolt,et al.  The fading of reported effectiveness. A meta-analysis of randomised controlled trials , 2006, BMC medical research methodology.

[12]  Michael C. Frank,et al.  Estimating the reproducibility of psychological science , 2015, Science.

[13]  Jihoon Kim,et al.  iDASH: integrating data for analysis, anonymization, and sharing , 2012, J. Am. Medical Informatics Assoc..

[14]  Isaac S. Kohane,et al.  Architecture of the Open-source Clinical Research Chart from Informatics for Integrating Biology and the Bedside , 2007, AMIA.

[15]  Isaac S. Kohane,et al.  A translational engine at the national scale: informatics for integrating biology and the bedside , 2012, J. Am. Medical Informatics Assoc..

[16]  GaniAbdullah,et al.  The rise of "big data" on cloud computing , 2015 .

[17]  Dina Aronzon,et al.  tranSMART: An Open Source Knowledge Management and High Content Data Analytics Platform , 2014, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[18]  Leo A. Celi,et al.  The MIMIC Code Repository: enabling reproducibility in critical care research , 2017, J. Am. Medical Informatics Assoc..

[19]  Dursun Delen,et al.  An assessment and cleaning framework for electronic health records data , 2018 .

[20]  Griffin M. Weber,et al.  Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2) , 2010, J. Am. Medical Informatics Assoc..

[21]  J. Basney CILogon: An Integrated Identity and Access Management Platform for Science , 2016 .

[22]  Ekaba Bisong,et al.  Kubeflow and Kubeflow Pipelines , 2019, Building Machine Learning and Deep Learning Models on Google Cloud Platform.

[23]  M. Cook,et al.  Temporal trends of esophageal disorders by age in the Cerner Health Facts database. , 2016, Annals of epidemiology.

[24]  Fred D. Davis,et al.  Geisinger's effort to realize its potential as a learning health system: A progress report , 2020, Learning health systems.

[25]  Alex A. T. Bui,et al.  Envisioning the future of 'big data' biomedicine , 2017, J. Biomed. Informatics.

[26]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[27]  Louis Ehwerhemuepha,et al.  HealtheDataLab – a cloud computing solution for data science and advanced analytics in healthcare with application to predicting multi-center pediatric readmissions , 2020, BMC Medical Informatics and Decision Making.

[28]  Philippe Bonnet,et al.  Computational reproducibility: state-of-the-art, challenges, and database research opportunities , 2012, SIGMOD Conference.

[29]  Yiming Yang,et al.  Cumulus: a cloud-based data analysis framework for large-scale single-cell and single-nucleus RNA-seq , 2019, bioRxiv.

[30]  Vipin Kumar,et al.  Trends in big data analytics , 2014, J. Parallel Distributed Comput..

[31]  K Mandl,et al.  Grappling with the Future Use of Big Data for Translational Medicine and Clinical Care , 2017, Yearbook of Medical Informatics.

[32]  J. Ioannidis Contradicted and initially stronger effects in highly cited clinical research. , 2005, JAMA.

[33]  Idafen Santana-Perez,et al.  Towards Reproducibility in Scientific Workflows: An Infrastructure-Based Approach , 2015, Sci. Program..

[34]  R. Ness Influence of the HIPAA Privacy Rule on health research. , 2007, JAMA.

[35]  Marco Buongiorno Nardelli,et al.  The high-throughput highway to computational materials design. , 2013, Nature materials.

[36]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[37]  Alban Gaignard,et al.  Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities , 2017, Future Gener. Comput. Syst..

[38]  John L. Schnase,et al.  MERRA Analytic Services: Meeting the Big Data challenges of climate science through cloud-enabled Climate Analytics-as-a-Service , 2013, Comput. Environ. Urban Syst..

[39]  Raimond L. Winslow,et al.  WaveformECG: A Platform for Visualizing, Annotating, and Analyzing ECG Data , 2016, Computing in Science & Engineering.

[40]  Michelle Dunn,et al.  The National Institutes of Health's Big Data to Knowledge (BD2K) initiative: capitalizing on biomedical big data , 2014, J. Am. Medical Informatics Assoc..

[41]  Jenine K. Harris,et al.  Use of reproducible research practices in public health: A survey of public health analysts , 2018, PloS one.