论文信息 - On Defining Rules for Cancer Data Fabrication

On Defining Rules for Cancer Data Fabrication

Data is essential for machine learning projects, and data accuracy is crucial for being able to trust the results obtained from the associated machine learning models. Previously, we have developed machine learning models for predicting the treatment outcome for breast cancer patients that have undergone chemotherapy, and developed a monitoring system for their treatment timeline showing interactively the options and associated predictions. Available cancer datasets, such as the one used earlier, are often too small to obtain significant results, and make it difficult to explore ways to improve the predictive capability of the models further. In this paper, we explore an alternative to enhance our datasets through synthetic data generation. From our original dataset, we extract rules to generate fabricated data that capture the different characteristics inherent in the dataset. Additional rules can be used to capture general medical knowledge. We show how to formulate rules for our cancer treatment data, and use the IBM solver to obtain a corresponding synthetic dataset. We discuss challenges for future work.

Juliana Küster Filipe Bowles | Eyal Bin | Michael Vinov | Agastya Silvina

[1] Peter Hall,et al. Combining Patient Pathway Visualisation with Predicion Outcomes for Chemotherapy Treatments , 2020 .

[2] Nikolaj Bjørner,et al. Satisfiability modulo theories , 2011, Commun. ACM.

[3] Hans-Martin Adorf,et al. Constraint-Based Automated Generation of Test Data , 2014, SWQD.

[4] Allon Adir,et al. Dynamic Test Data Generation for Data Intensive Applications , 2011, Haifa Verification Conference.

[5] Haldun Akoglu,et al. User's guide to correlation coefficients , 2018, Turkish journal of emergency medicine.

[6] Edward P. K. Tsang,et al. Foundations of constraint satisfaction , 1993, Computation in cognitive science.

[7] Juliana Küster Filipe Bowles,et al. On Predicting the Outcomes of Chemotherapy Treatments in Breast Cancer , 2019, AIME.

[8] Jerome P. Reiter,et al. Using CART to generate partially synthetic public use microdata , 2005 .

[9] A. Pitsillides,et al. The SERUMS tool-chain: Ensuring Security and Privacy of Medical Data in Smart Patient-Centric Healthcare Systems , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[10] G. Allison,et al. Charlson Comorbidities Index. , 2016, Journal of physiotherapy.

[11] Jerome P. Reiter,et al. Random Forests for Generating Partially Synthetic, Categorical Data , 2010, Trans. Data Priv..

[12] Stéphane Bressan,et al. Comparative Evaluation of Synthetic Data Generation Methods , 2017 .