An Ontology-based Approach to Guide and Document Variable and Data Source Selection and Data Integration Process to Support Integrative Data Analysis in Cancer Outcomes Research

Background To reduce cancer mortality and improve cancer outcomes, it is critical to understand the various cancer risk factors (RFs) across different domains (e.g., genetic, environmental, and behavioral risk factors) and levels (e.g., individual, interpersonal, and community levels). However, prior research on RFs of cancer outcomes, has primarily focused on individual level RFs due to the lack of integrated datasets that contain multi-level, multi-domain RFs. Further, the lack of a consensus and proper guidance on systematically identify RFs also increase the difficulty of RF selection from heterogenous data sources in a multi-level integrative data analysis (mIDA) study. More importantly, as mIDA studies require integrating heterogenous data sources, the data integration processes in the limited number of existing mIDA studies are inconsistently performed and poorly documented, and thus threatening transparency and reproducibility. Methods Informed by the National Institute on Minority Health and Health Disparities (NIMHD) research framework, we (1) reviewed existing reporting guidelines from the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) network and (2) developed a theory-driven reporting guideline to guide the RF variable selection, data source selection, and data integration process. Then, we developed an ontology to standardize the documentation of the RF selection and data integration process in mIDA studies. Results We summarized the review results and created a reporting guideline -ATTEST - for reporting the variable selection and data source selection and integration process. We provided an ATTEST check list to help researchers to annotate and clearly document each step of their mIDA studies to ensure the transparency and reproducibility. We used the ATTEST to report two mIDA case studies and further transformed annotation results into sematic triples, so that the relationships among variables, data sources and integration processes are explicitly standardized and modeled using the classes and properties from OD-ATTEST. Conclusion Our ontology-based reporting guideline solves some key challenges in current mIDA studies for cancer outcomes research, through providing (1) a theory-driven guidance for multi-level and multi-domain RF variable and data source selection; and (2) a standardized documentation of the data selection and integration processes powered by an ontology, thus a way to enable sharing of mIDA study reports among researchers.

[1]  Robert B Hines,et al.  Disparities in Late Stage Diagnosis, Treatment, and Breast Cancer-Related Death by Race, Age, and Rural Residence Among Women in Georgia , 2012, Women & health.

[2]  Patrick B. Ryan,et al.  Transparent Reporting of Data Quality in Distributed Data Networks , 2015, EGEMS.

[3]  Paolo Vineis,et al.  STrengthening the Reporting of OBservational studies in Epidemiology - Molecular Epidemiology (STROBE-ME): an extension of the STROBE statement. , 2011, Preventive medicine.

[4]  Gailen D. Marshall,et al.  Minimum data elements for research reports on CFS , 2012, Brain, Behavior, and Immunity.

[5]  Aditya Ghose,et al.  The effect of imputing missing clinical attribute values on training lung cancer survival prediction model performance , 2017, Health Inf. Sci. Syst..

[6]  Yi Guo,et al.  The relationships among individual and regional smoking, socioeconomic status, and oral and pharyngeal cancer survival: a mediation analysis , 2015, Cancer medicine.

[7]  Dennis L. Jackson Reporting results of latent growth modeling and multilevel modeling analyses: some recommendations for rehabilitation psychology. , 2010, Rehabilitation psychology.

[8]  Robert Arp,et al.  Building Ontologies with Basic Formal Ontology , 2015 .

[9]  David Atkins,et al.  Good research practices for comparative effectiveness research: defining, reporting and interpreting nonrandomized studies of treatment effects using secondary data sources: the ISPOR Good Research Practices for Retrospective Database Analysis Task Force Report--Part I. , 2009, Value in health : the journal of the International Society for Pharmacoeconomics and Outcomes Research.

[10]  Paolo Vineis,et al.  STrengthening the Reporting of OBservational studies in Epidemiology--Molecular Epidemiology (STROBE-ME): an extension of the STROBE statement. , 2011, Mutagenesis.

[11]  W. Dixon,et al.  Launch of a checklist for reporting longitudinal observational drug studies in rheumatology: a EULAR extension of STROBE guidelines based on experience from biologics registries , 2013, Annals of the rheumatic diseases.

[12]  G. Xing,et al.  Comorbidities and mammography use interact to explain racial/ethnic disparities in breast cancer stage at diagnosis , 2011, Cancer.

[13]  L. Mobley,et al.  Demographic Disparities in Late-Stage Diagnosis of Breast and Colorectal Cancers Across the USA , 2016, Journal of Racial and Ethnic Health Disparities.

[14]  Yi Guo,et al.  An ontology-guided semantic data integration framework to support integrative data analysis of cancer survival , 2018, BMC Medical Informatics and Decision Making.

[15]  S. Madhi,et al.  Strengthening the Reporting of Observational Studies in Epidemiology for Newborn Infection (STROBE-NI): an extension of the STROBE statement for neonatal infection research. , 2016, The Lancet. Infectious diseases.

[16]  J. Ioannidis,et al.  Strengthening the reporting of Genetic RIsk Prediction Studies: the GRIPS Statement. , 2011, Journal of clinical epidemiology.

[17]  Manolis Tsiknakis,et al.  The INTEGRATE project: Delivering solutions for efficient multi-centric clinical research and trials , 2016, J. Biomed. Informatics.

[18]  V. Feigin,et al.  Development of the standards of reporting of neurological disorders (STROND) checklist: a guideline for the reporting of incidence and prevalence studies in neuroepidemiology , 2015, European Journal of Epidemiology.

[19]  P. Horby,et al.  CONSISE statement on the reporting of Seroepidemiologic Studies for influenza (ROSES‐I statement): an extension of the STROBE statement , 2016, Influenza and other respiratory viruses.

[20]  Olaf Klungel,et al.  The reporting of studies conducted using observational routinely collected health data statement for pharmacoepidemiology (RECORD-PE) , 2018, British Medical Journal.

[21]  N. Dubrawsky Cancer statistics , 1989, CA: a cancer journal for clinicians.

[22]  S G E Marsh,et al.  A community standard for immunogenomic data reporting and analysis: proposal for a STrengthening the REporting of Immunogenomic Studies statement. , 2011, Tissue antigens.

[23]  E. Mulvey,et al.  Reporting guidance for violence risk assessment predictive validity studies: the RAGEE Statement. , 2015, Law and human behavior.

[24]  Trish Groves,et al.  Enhancing the quality and transparency of health research , 2008, BMJ : British Medical Journal.

[25]  Kathleen F Kerr,et al.  RiGoR: reporting guidelines to address common sources of bias in risk model development , 2015, Biomarker Research.

[26]  L. Leibovici,et al.  STROBE-AMS: recommendations to optimise reporting of epidemiological studies on antimicrobial resistance and informing improvement in antimicrobial stewardship , 2016, BMJ Open.

[27]  Laura H. Kahn,et al.  Checklist for One Health Epidemiological Reporting of Evidence (COHERE) , 2017, One health.

[28]  Gary S Collins,et al.  Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD): The TRIPOD Statement. , 2015, European urology.

[29]  Csongor Nyulas,et al.  BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications , 2011, Nucleic Acids Res..

[30]  S. Pocock,et al.  The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. , 2008, Journal of clinical epidemiology.

[31]  David Moher,et al.  The REporting of Studies Conducted Using Observational Routinely-Collected Health Data (RECORD) Statement: Methods for Arriving at Consensus and Developing Reporting Guidelines , 2015, PloS one.

[32]  S. Glaser,et al.  The influence of nativity and neighborhoods on breast cancer stage at diagnosis and survival among California Hispanic women , 2010, BMC Cancer.

[33]  E. Forsum,et al.  Strengthening the Reporting of Observational Studies in Epidemiology—Nutritional Epidemiology (STROBE-nut): An Extension of the STROBE Statement , 2016, PLoS medicine.

[34]  Klaus R. Dittrich,et al.  Data Provenance: A Categorization of Existing Approaches , 2007, BTW.

[35]  Jiang Bian,et al.  Assessing the effect of data integration on predictive ability of cancer survival models , 2020, Health Informatics J..

[36]  Shiraz I Mishra,et al.  Breast cancer epidemiology in blacks and whites: disparities in incidence, mortality, survival rates and histology. , 2008, Journal of the National Medical Association.

[37]  P. Eke,et al.  Standards for reporting chronic periodontitis prevalence and severity in epidemiologic studies: Proposed standards from the Joint EU/USA Periodontal Epidemiology Working Group. , 2015, Journal of clinical periodontology.

[38]  Patrick D Schloss,et al.  Identifying and Overcoming Threats to Reproducibility, Replicability, Robustness, and Generalizability in Microbiome Research , 2018, mBio.

[39]  Etienne G. Krug,et al.  Violence: a global public health problem , 2006 .

[40]  L. Borrell,et al.  A Local Area Analysis of Racial, Ethnic, and Neighborhood Disparities in Breast Cancer Staging , 2009, Cancer Epidemiology, Biomarkers & Prevention.

[41]  Matthew J. Salganik,et al.  Strengthening the Reporting of Observational Studies in Epidemiology for respondent-driven sampling studies: “STROBE-RDS” statement , 2015, Journal of clinical epidemiology.

[42]  N. Keating,et al.  Racial differences in breast cancer stage at diagnosis in the mammography era. , 2013, American journal of public health.

[43]  Jiang Bian,et al.  Ontology for Documentation of Variable and Data Source Selection Process to Support Integrative Data Analysis in Cancer Outcomes Research , 2019, SEPDA@ISWC.

[44]  I. Wilson,et al.  ESPACOMP Medication Adherence Reporting Guideline (EMERGE) , 2018, Annals of Internal Medicine.

[45]  T. Vos,et al.  Guidelines for Accurate and Transparent Health Estimates Reporting: the GATHER statement , 2016, PLoS medicine.

[46]  A. Silman,et al.  Preliminary core set of domains and reporting requirements for longitudinal observational studies in rheumatology. , 1999, The Journal of rheumatology.

[47]  M. Follen,et al.  Nativity disparities in late-stage diagnosis and cause-specific survival among Hispanic women with invasive cervical cancer: an analysis of Surveillance, Epidemiology, and End Results data , 2013, Cancer Causes & Control.