A Dynamic Data Warehousing Platform for Creating and Accessing Biomedical Data Lakes

Medical research use cases are population centric, unlike the clinical use cases which are patient or individual centric. Hence the research use cases require accessing medical archives and data source repositories of heterogeneous nature. Traditionally, in order to query data from these data sources, users manually access and download parts or whole of the data sources. The existing solutions tend to focus on a specific data format or storage, which prevents using them for a more generic research scenario with heterogeneous data sources where the user may not have the knowledge of the schema of the data a priori. In this paper, we propose and discuss the design, implementation, and evaluation of Data Cafe, a scalable distributed architecture that aims to address the shortcomings in the existing approaches. Data Cafe lets the resource providers create biomedical data lakes from various data sources, and lets the research data users consume the data lakes efficiently and quickly without having a priori knowledge of the data schema.

[1]  Griffin M. Weber,et al.  Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2) , 2010, J. Am. Medical Informatics Assoc..

[2]  Nich Wattanasin,et al.  Integration of Hive and cell software in the i2b2 architecture. , 2007, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[3]  Joachim Roski,et al.  Creating value in health care through big data: opportunities and policy implications. , 2014, Health affairs.

[4]  Stephen M. Moore,et al.  The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository , 2013, Journal of Digital Imaging.

[5]  Mark Levene,et al.  Why is the snowflake schema a good data warehouse design? , 2003, Inf. Syst..

[6]  W Huda,et al.  Picture archiving and communications systems (PACS). , 1994, Current problems in diagnostic radiology.

[7]  Daniel L. Rubin,et al.  Medical Imaging on the Semantic Web: Annotation and Image Markup , 2008, AAAI Spring Symposium: Semantic Scientific Knowledge Integration.

[8]  Jordan Tigani,et al.  Google BigQuery Analytics , 2014 .

[9]  Anurag Gupta,et al.  Amazon Redshift and the Case for Simpler Data Warehouses , 2015, SIGMOD Conference.

[10]  Wan-Sup Cho,et al.  CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment , 2016, PloS one.

[11]  Michael Hausenblas,et al.  Apache Drill: Interactive Ad-Hoc Analysis at Scale , 2013, Big Data.

[12]  Isaac S. Kohane,et al.  Architecture of the Open-source Clinical Research Chart from Informatics for Integrating Biology and the Bedside , 2007, AMIA.

[13]  Christopher G. Chute,et al.  The Enterprise Data Trust at Mayo Clinic: a semantically integrated warehouse of biomedical data , 2010, J. Am. Medical Informatics Assoc..

[14]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[15]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[16]  Starkschall Design Specifications for a Radiation Oncology Picture Archival and Communication System. , 1997, Seminars in radiation oncology.

[17]  Kamran Sartipi,et al.  HL7 FHIR: An Agile and RESTful approach to healthcare information exchange , 2013, Proceedings of the 26th IEEE International Symposium on Computer-Based Medical Systems.

[18]  Kenneth D. Mandl,et al.  Are Meaningful Use Stage 2 certified EHRs ready for interoperability? Findings from the SMART C-CDA Collaborative , 2014, J. Am. Medical Informatics Assoc..

[19]  Patrice Degoulet,et al.  Medical Decision Support Systems , 1997 .

[20]  D. Roden,et al.  The Emerging Role of Electronic Medical Records in Pharmacogenomics , 2011, Clinical pharmacology and therapeutics.

[21]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.