Ensemble Feature Learning to Identify Risk Factors for Predicting Secondary Cancer

Background: In recent years, the development and diagnosis of secondary cancer have become the primary concern of cancer survivors. A number of studies have been developing strategies to extract knowledge from the clinical data, aiming to identify important risk factors that can be used to prevent the recurrence of diseases. However, these studies do not focus on secondary cancer. Secondary cancer is lack of the strategies for clinical treatment as well as risk factor identification to prevent the occurrence. Methods: We propose an effective ensemble feature learning method to identify the risk factors for predicting secondary cancer by considering class imbalance and patient heterogeneity. We first divide the patients into some heterogeneous groups based on spectral clustering. In each group, we apply the oversampling method to balance the number of samples in each class and use them as training data for ensemble feature learning. The purpose of ensemble feature learning is to identify the risk factors and construct a diagnosis model for each group. The importance of risk factors is measured based on the properties of patients in each group separately. We predict secondary cancer by assigning the patient to a corresponding group and based on the diagnosis model in this corresponding group. Results: Analysis of the results shows that the decision tree obtains the best results for predicting secondary cancer in the three classifiers. The best results of the decision tree are 0.72 in terms of AUC when dividing the patients into 15 groups, 0.38 in terms of F1 score when dividing the patients into 20 groups. In terms of AUC, decision tree achieves 67.4% improvement compared to using all 20 predictor variables and 28.6% improvement compared to no group division. In terms of F1 score, decision tree achieves 216.7% improvement compared to using all 20 predictor variables and 80.9% improvement compared to no group division. Different groups provide different ranking results for the predictor variables. Conclusion: The accuracies of predicting secondary cancer using k-nearest neighbor, decision tree, support vector machine indeed increased after using the selected important risk factors as predictors. Group division on patients to predict secondary cancer on the separated models can further improve the prediction accuracies. The information discovered in the experiments can provide important references to the personality and clinical symptom representations on all phases of guide interventions, with the complexities of multiple symptoms associated with secondary cancer in all phases of the recurrent trajectory.

[1]  A. Hanlon,et al.  Second cancers after conservative surgery and radiation for stages I-II breast cancer: identifying a subset of women at increased risk. , 2001, International journal of radiation oncology, biology, physics.

[2]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[3]  Feipei Lai,et al.  Recurrence predictive models for patients with hepatocellular carcinoma after radiofrequency ablation using support vector machines with feature selection methods , 2014, Comput. Methods Programs Biomed..

[4]  S. Senan,et al.  Correlation, Causation and Confounding-What Is the True Risk of Lung Cancer following Breast Cancer Radiotherapy? , 2017, Journal of thoracic oncology : official publication of the International Association for the Study of Lung Cancer.

[5]  Tetsuya Sakurai,et al.  Spectral clustering with adaptive similarity measure in Kernel space , 2018, Intell. Data Anal..

[6]  Dimitrios I. Fotiadis,et al.  Machine learning applications in cancer prognosis and prediction , 2014, Computational and structural biotechnology journal.

[7]  Tetsuya Sakurai,et al.  Large Scale Spectral Clustering Using Sparse Representation Based on Hubness , 2018, 2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI).

[8]  How-Ran Guo,et al.  Risk of secondary cancers in women with breast cancer and the influence of radiotherapy , 2016, Medicine.

[9]  A. Ng,et al.  Subsequent Malignant Neoplasms in Cancer Survivors , 2008, Cancer journal.

[10]  C. Pui,et al.  Cancer survivorship--genetic susceptibility and second primary cancers: research strategies and recommendations. , 2006, Journal of the National Cancer Institute.

[11]  Chi-Chang Chang,et al.  Multiple primary malignant neoplasms: Results from a 5-year retrospective analysis in a Metropolitan Hospital , 2017 .

[12]  L. Travis,et al.  The Epidemiology of Second Primary Cancers , 2006, Cancer Epidemiology Biomarkers & Prevention.

[13]  Soni Jyoti,et al.  Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction , 2011 .

[14]  J. Olsen,et al.  Risk of second cancer among women with breast cancer , 2006, International journal of cancer.

[15]  Anas M. Saad,et al.  Risk and survival of chronic myeloid leukemia after breast cancer: A population-based study. , 2019, Current problems in cancer.

[16]  C. Rubino,et al.  Increased risk of second cancers following breast cancer: Role of the initial treatment , 2000, Breast Cancer Research and Treatment.

[17]  Miao-Fen Chen,et al.  Increased Risk for Second Primary Malignancies in Women with Breast Cancer Diagnosed at Young Age: A Population-Based Study in Taiwan , 2008, Cancer Epidemiology Biomarkers & Prevention.

[18]  R. Kaaks,et al.  Obesity , Endogenous Hormones , and Endometrial Cancer Risk : A Synthetic Review 1 , 2002 .

[19]  P. Potemski,et al.  Secondary cancer in a survivor of Hodgkin’s lymphoma: A case report and review of the literature , 2014, Oncology letters.

[20]  Miriam Seoane Santos,et al.  A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients , 2015, J. Biomed. Informatics.

[21]  Chi-Jie Lu,et al.  A clustering-based sales forecasting scheme by using extreme learning machine and ensembling linkage methods with applications to computer server , 2016, Eng. Appl. Artif. Intell..

[22]  Chih-Jen Tseng,et al.  Integration of data mining classification techniques and ensemble learning to identify risk factors and diagnose ovarian cancer recurrence , 2017, Artif. Intell. Medicine.

[23]  Tetsuya Sakurai,et al.  Robust Similarity Measure for Spectral Clustering Based on Shared Neighbors , 2016 .