Big data and precision medicine: challenges and strategies with healthcare data

Recent snapshots of the European progress on big data in health care and precision medicine reveal diverse perceptions of experts and the public, leading to the impression that algorithmic issues have the largest share among the challenges all health systems are faced with. Yet, from a comparison of different countries it is evident that the adaption and integration of heterogeneous data sources have a major impact on the advancement of precision medicine. Legal regulations for implementation and operation of healthcare networking are actively discussed in the public and gradually implemented in several countries. Based on a unified documentation, they are a perfect precondition for integrating distributed healthcare data to a big data platform with a reliable fact representation. Now, basic and clinical scientists have to be motivated to share their work with these data platforms. In this work, we aim to provide an overview on the common issues in big healthcare data applications and address the challenges for the involved scientific, clinical and administrative partners. We propose a possible strategy for a comprehensive data integration by iterating data harmonization, semantic enrichment and data analysis processes.

[1]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[2]  A. Stege,et al.  Erfahrungen bei Einrichtung und Betrieb einer Biobank , 2008, Der Pathologe.

[3]  Hans A. Kestler,et al.  A highly efficient multi-core algorithm for clustering extremely large datasets , 2010, BMC Bioinformatics.

[4]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[5]  David Gomez-Cabrero,et al.  Data integration in the era of omics: current and future challenges , 2014, BMC Systems Biology.

[6]  McGinnis Jm,et al.  The learning healthcare system : workshop summary , 2007 .

[7]  Pasquale Caponnetto,et al.  The Health Effects of Electronic Cigarettes. , 2016, The New England journal of medicine.

[8]  Russ B. Altman,et al.  A research roadmap for next-generation sequencing informatics , 2016, Science Translational Medicine.

[9]  C. Sreeramareddy,et al.  Decentralised versus centralised governance of health services , 2013, Cochrane Database of Systematic Reviews.

[10]  P. Lambin,et al.  Machine Learning methods for Quantitative Radiomic Biomarkers , 2015, Scientific Reports.

[11]  Griffin M. Weber,et al.  Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2) , 2010, J. Am. Medical Informatics Assoc..

[12]  H. Kestler,et al.  A new tool linking human cytomegalovirus drug resistance mutations to resistance phenotypes. , 2010, Antiviral research.

[13]  Rainer Spang,et al.  Comprehensive Metaboproteomics of Burkitt's and Diffuse Large B-Cell Lymphoma Cell Lines and Primary Tumor Tissues Reveals Distinct Differences in Pyruvate Content and Metabolism. , 2017, Journal of proteome research.

[14]  N. Schork Personalized medicine: Time for one-person trials , 2015, Nature.

[15]  Steffen Schulze-Kremer,et al.  Ontologies for molecular biology and bioinformatics , 2002, Silico Biol..

[16]  Michael B. Black,et al.  Comparison of microarrays and RNA-seq for gene expression analyses of dose-response experiments. , 2014, Toxicological sciences : an official journal of the Society of Toxicology.

[17]  R. Spang,et al.  A biologic definition of Burkitt's lymphoma from transcriptional and genomic profiling. , 2006, The New England journal of medicine.

[18]  Longbing Cao Data science , 2017, Commun. ACM.

[19]  Friedhelm Schwenker,et al.  Three learning phases for radial-basis-function networks , 2001, Neural Networks.

[20]  Jürgen Sühnel,et al.  AgeFactDB—the JenAge Ageing Factor Database—towards data integration in ageing research , 2013, Nucleic Acids Res..

[21]  P. Harris,et al.  Research electronic data capture (REDCap) - A metadata-driven methodology and workflow process for providing translational research informatics support , 2009, J. Biomed. Informatics.

[22]  Ara Darzi,et al.  Preparing for precision medicine. , 2012, The New England journal of medicine.

[23]  Lyn-Rouven Schirra,et al.  Rank-based classifiers for extremely high-dimensional gene expression data , 2018, Adv. Data Anal. Classif..

[24]  Thomas Ganslandt,et al.  Leitfaden zum Datenschutz in medizinischen Forschungsprojekten: Generische Lösungen der TMF 2.0 , 2014 .

[25]  Ariel Farkash,et al.  Large Scale Healthcare Data Integration and Analysis using the Semantic Web , 2011, MIE.

[26]  P. Lambin,et al.  Decision support systems for personalized and participative radiation oncology☆ , 2017, Advanced drug delivery reviews.

[27]  Cheng Zhang,et al.  Biomedical text mining and its applications in cancer research , 2013, J. Biomed. Informatics.

[28]  Michael Friedewald,et al.  Open consent, biobanking and data protection law: can open consent be ‘informed’ under the forthcoming data protection regulation? , 2015, Life Sciences, Society and Policy.

[29]  Thomas Tolxdorff,et al.  Ontology-Based Information Extraction: Identifying Eligible Patients for Clinical Trials in Neurology , 2014, Journal on Data Semantics.

[30]  Vladimir A. Kuznetsov,et al.  Big genomics and clinical data analytics strategies for precision cancer prognosis , 2016, Scientific Reports.

[31]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[32]  Leo Anthony Celi,et al.  Dynamic Clinical Data Mining: Search Engine-Based Decision Support , 2014, JMIR medical informatics.

[33]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[34]  Nadine Schuurman,et al.  Ontologies for Bioinformatics , 2008, Bioinformatics and biology insights.

[35]  Lyn-Rouven Schirra,et al.  Genetic Factors of the Disease Course After Sepsis: Rare Deleterious Variants Are Predictive , 2016, EBioMedicine.

[36]  Z. Obermeyer,et al.  Predicting the Future - Big Data, Machine Learning, and Clinical Medicine. , 2016, The New England journal of medicine.

[37]  Steven Skiena,et al.  The Data Science Design Manual , 2017, Texts in Computer Science.

[38]  Florian Schmid,et al.  GiANT: gene set uncertainty in enrichment analysis , 2016, Bioinform..

[39]  Debopriya Ghosh,et al.  Data science, learning, and applications to biomedical and health sciences , 2017, Annals of the New York Academy of Sciences.

[40]  Riccardo Bellazzi,et al.  Combining clinical and genomics queries using i2b2 – Three methods , 2017, PloS one.

[41]  Dean F Sittig,et al.  Challenges in patient safety improvement research in the era of electronic health records. , 2016, Healthcare.

[42]  H. Kestler,et al.  Differentiation of multiple types of pancreatico-biliary tumors by molecular analysis of clinical specimens , 2011, Journal of Molecular Medicine.

[43]  H A Kestler,et al.  Chitinase enzyme activity in CSF is a powerful biomarker of Alzheimer disease , 2012, Neurology.

[44]  Ninja Marnau Anonymisierung, Pseudonymisierung und Transparenz für Big Data , 2016, Datenschutz und Datensicherheit - DuD.

[45]  Somjit Arch-int,et al.  A semantic interoperability approach to health‐care data: Resolving data‐level conflicts , 2016, Expert Syst. J. Knowl. Eng..

[46]  Rui Chen,et al.  Promise of personalized omics to precision medicine , 2013, Wiley interdisciplinary reviews. Systems biology and medicine.

[47]  Inigo Martincorena,et al.  Precision oncology for acute myeloid leukemia using a knowledge bank approach , 2017, Nature Genetics.

[48]  Yu-Chuan Li,et al.  Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers , 2015, MedInfo.

[49]  Lyn-Rouven Schirra,et al.  The Influence of Multi-class Feature Selection on the Prediction of Diagnostic Phenotypes , 2017, Neural Processing Letters.

[50]  Douglas G Altman,et al.  Reporting recommendations for tumor marker prognostic studies (REMARK): explanation and elaboration , 2012, BMC Medicine.

[51]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[52]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[53]  Rae Woong Park,et al.  Characterizing treatment pathways at scale using the OHDSI network , 2016, Proceedings of the National Academy of Sciences.

[54]  In-Wha Kim,et al.  Deep learning: from chemoinformatics to precision medicine , 2017, Journal of Pharmaceutical Investigation.

[55]  M. Amatayakul Electronic Health Records:: A Practical Guide for Professionals and Organizations , 2004 .

[56]  Jake Luo,et al.  Big Data Application in Biomedical Research and Health Care: A Literature Review , 2016, Biomedical informatics insights.

[57]  C. Y. Peng,et al.  Principled missing data methods for researchers , 2013, SpringerPlus.

[58]  Zhenyu Wu,et al.  Towards a Semantic Web of Things: A Hybrid Semantic Annotation, Extraction, and Reasoning Framework for Cyber-Physical System , 2017, Sensors.

[59]  Hugues Bersini,et al.  Batch effect removal methods for microarray gene expression data integration: a survey , 2013, Briefings Bioinform..

[60]  Andrew R. Webb,et al.  Statistical Pattern Recognition , 1999 .

[61]  Nicholas Moore,et al.  Using big data from health records from four countries to evaluate chronic disease outcomes: a study in 114 364 survivors of myocardial infarction , 2016, European heart journal. Quality of care & clinical outcomes.

[62]  Michael Y. Galperin,et al.  The 24th annual Nucleic Acids Research database issue: a look back and upcoming changes , 2017, Nucleic acids research.

[63]  Magnus Rattray,et al.  Making sense of big data in health research: Towards an EU action plan , 2016, Genome Medicine.

[64]  Ludwig Lausser,et al.  Comparative gene-expression profiling of the large cell variant of gastrointestinal marginal-zone B-cell lymphoma , 2017, Scientific Reports.

[65]  Ruth Gilbert,et al.  The market in healthcare data , 2015, BMJ : British Medical Journal.

[66]  D. Longo,et al.  Precision medicine--personalized, problematic, and promising. , 2015, The New England journal of medicine.

[67]  Padmini Srinivasan,et al.  MeSH: a window into full text for document summarization , 2011, Bioinform..

[68]  Fr. Jobst,et al.  IT zur Prozessgestaltung im Krankenhaus – Wie bekommt man die optimale Kombination von IT-Anwendungen? , 2010 .

[69]  Christoph Meinel,et al.  Deep Learning for Medical Image Analysis , 2018, Journal of Pathology Informatics.

[70]  Matthias Dehmer,et al.  Against Dataism and for Data Sharing of Big Biomedical and Clinical Data with Research Parasites , 2016, Front. Genet..

[71]  Florian Schmid,et al.  Unlabeling data can improve classification accuracy , 2014, Pattern Recognit. Lett..

[72]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[73]  Yavor Nenov,et al.  Semantic Technologies for Data Analysis in Health Care , 2016, SEMWEB.

[74]  Inga Bernemann,et al.  Zentralisierte Biobanken als Grundlage für die medizinische Forschung , 2016, Bundesgesundheitsblatt - Gesundheitsforschung - Gesundheitsschutz.

[75]  Lei Zhang,et al.  Formal Uncertainty and Dispersion of Single and Double Difference Models for GNSS-Based Attitude Determination , 2017, Sensors.

[76]  Jan Budczies,et al.  Parallel screening for ALK, MET and ROS1 alterations in non-small cell lung cancer with implications for daily routine testing. , 2015, Lung cancer.

[77]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.