A hybrid data mining approach for identifying the temporal effects of variables associated with breast cancer survival

Abstract Predicting breast cancer survival is crucial for practitioners to determine possible outcomes and make better treatment plans for the patients. In this study, a hybrid data mining based methodology was constructed to differentiate the variables whose importance for survival change over time. Therefore, the importance of variables was determined for three different time periods (i.e. one, five, and ten years). To conduct such an analysis, the most parsimonious models were constructed by employing one regression analysis method—Least Absolute Shrinkage and Selection Operator (LASSO), and one metaheuristic optimization method, namely a Genetic Algorithm (GA). Due to the high imbalance between the number of survivals and deaths, two well-known resampling procedures—Random Under-sampling (RUS) and Synthetic Minority Over-sampling Technique (SMOTE)—were applied to increase the performance of the classification models. In the final stage, two data mining models, namely Artificial Neural Networks (ANNs) and Logistic Regression (LR), were utilized along with 10-fold cross-validation. Sensitivity analysis (SA) was conducted for each model to identify the importance of each variable for a certain model and time period. The obtained results revealed that certain variables lose their importance over time, while others gain importance. This information can assist medical practitioners in identifying specific subsets of variables to focus on in different periods, which will in turn lead to a more effective and efficient cancer care. Moreover, the study findings indicate that extremely parsimonious models can be developed by adopting a purely data-driven approach, rather than eliminating the variables manually. Such methodology can also be applied in treating other types of cancer.

[1]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[2]  Ali Dag,et al.  Predicting heart transplantation outcomes through data analytics , 2017, Decis. Support Syst..

[3]  A. Saltelli,et al.  Making best use of model evaluations to compute sensitivity indices , 2002 .

[4]  Varghese S. Jacob,et al.  Computing, Artificial Intelligence and Information Management Breast cancer prediction using the isotonic separation technique , 2007 .

[5]  Dursun Delen,et al.  Analysis of cancer data: a data mining approach , 2009, Expert Syst. J. Knowl. Eng..

[6]  I. Ellis,et al.  Prognostic and predictive factors in primary breast cancer and their role in patient management: The Nottingham Breast Team. , 2001, European journal of surgical oncology : the journal of the European Society of Surgical Oncology and the British Association of Surgical Oncology.

[7]  M. Pike,et al.  National Institutes of Health State-of-the-Science Conference statement: Diagnosis and Management of Ductal Carcinoma In Situ September 22-24, 2009. , 2010, Journal of the National Cancer Institute.

[8]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[9]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[10]  Hongnian Yu,et al.  A combination selection algorithm on forecasting , 2014, Eur. J. Oper. Res..

[11]  H. Weir,et al.  The past, present, and future of cancer incidence in the United States: 1975 through 2020 , 2015, Cancer.

[12]  Stan Matwin,et al.  Classifying Severely Imbalanced Data , 2011, Canadian Conference on AI.

[13]  Ali Dag,et al.  An AHP-IFT Integrated Model for Performance Evaluation of E-Commerce Web Sites , 2018, Information Systems Frontiers.

[14]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[15]  Leonid Churilov,et al.  Improving risk grouping rules for prostate cancer patients with optimization , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[16]  Mehdi Faraji,et al.  Optimization of coating variables for hardness of industrial tools by using artificial neural networks , 2011, Expert Syst. Appl..

[17]  Gongping Yang,et al.  On the Class Imbalance Problem , 2008, 2008 Fourth International Conference on Natural Computation.

[18]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[19]  Amit P. Sheth,et al.  Predictive Analysis on Twitter: Techniques and Applications , 2018 .

[20]  Ali Dag,et al.  A comparative data analytic approach to construct a risk trade-off for cardiac patients' re-admissions , 2019, Ind. Manag. Data Syst..

[21]  Yuri Kotliarov,et al.  Unsupervised analysis of transcriptomic profiles reveals six glioma subtypes. , 2009, Cancer research.

[22]  M. Ando,et al.  Change in the hormone receptor status following administration of neoadjuvant chemotherapy and its impact on the long-term outcome in patients with primary breast cancer , 2009, British Journal of Cancer.

[23]  Dan W. Patterson,et al.  Artificial Neural Networks: Theory and Applications , 1998 .

[24]  Donald E. Henson,et al.  Relation of tumor size, lymph node status, and survival in 24,740 breast cancer cases , 1989 .

[25]  Vicenç Torra,et al.  Trends in Information fusion in Data Mining , 2003 .

[26]  Ali Dag,et al.  A Bayesian Approach to Detect the Firms with Material Weakness in Internal Control , 2019 .

[27]  R. Collins,et al.  Effects of radiotherapy and of differences in the extent of surgery for early breast cancer on local recurrence and 15-year survival: an overview of the randomised trials , 2005, The Lancet.

[28]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[29]  T. Uematsu,et al.  Comparison of estrogen receptor, progesterone receptor and Her-2 status in breast cancer pre- and post-neoadjuvant chemotherapy. , 2008, Breast.

[30]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[31]  Patricia Melin,et al.  A hybrid model based on modular neural networks and fuzzy systems for classification of blood pressure and hypertension risk diagnosis , 2018, Expert Syst. Appl..

[32]  Vadlamani Ravi,et al.  Colon cancer prediction with genetics profiles using evolutionary techniques , 2011, Expert Syst. Appl..

[33]  Amir Hassan Zadeh,et al.  Predicting overall survivability in comorbidity of cancers: A data mining approach , 2015, Decis. Support Syst..

[34]  Karen Gelmon,et al.  Metastatic behavior of breast cancer subtypes. , 2010, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[35]  Stefano Tarantola,et al.  Sensitivity Analysis in Practice , 2002 .

[36]  Ivan Bratko,et al.  Machine learning for survival analysis: a case study on recurrence of prostate cancer , 2000, Artif. Intell. Medicine.

[37]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[38]  Sang Won Yoon,et al.  A dynamic gradient boosting machine using genetic optimizer for practical breast cancer prognosis , 2019, Expert Syst. Appl..

[39]  N. Bundred,et al.  Prognostic and predictive factors in breast cancer. , 2001, Cancer treatment reviews.

[40]  R Yancik,et al.  Breast cancer in aging women. A population‐based study of contrasts in stage, surgery, and survival , 1989, Cancer.

[41]  W. Sauerbrei,et al.  Reporting recommendations for tumor marker prognostic studies (REMARK). , 2005, Journal of the National Cancer Institute.

[42]  Hyunjung Shin,et al.  Predicting breast cancer survivability using fuzzy decision trees for personalized healthcare. , 2008, Conference proceedings : ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual Conference.

[43]  Donald A. Berry,et al.  Estrogen-receptor status and outcomes of modern chemotherapy for patients with node-positive breast cancer , 2006 .

[44]  D. Neuberg,et al.  Relationship of patient age to pathologic features of the tumor and prognosis for patients with stage I or II breast cancer. , 1994, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[45]  David Liu,et al.  Artificial neural networks for optimization of gold-bearing slime smelting , 2009, Expert Syst. Appl..

[46]  Cheng Wu,et al.  Semi-Supervised and Unsupervised Extreme Learning Machines , 2014, IEEE Transactions on Cybernetics.

[47]  W. McGuire,et al.  Prognostic factors and therapeutic decisions in axillary node-negative breast cancer. , 1992, Annual review of medicine.

[48]  Ali Dag,et al.  Measuring the efficiency of hospitals: a fully-ranking DEA–FAHP approach , 2019, Ann. Oper. Res..

[49]  I Persson,et al.  The relation between survival and age at diagnosis in breast cancer. , 1986, The New England journal of medicine.

[50]  Ali Dag,et al.  A probabilistic data-driven framework for scoring the preoperative recipient-donor heart transplant survival , 2016, Decis. Support Syst..

[51]  H. Joensuu,et al.  Artificial Neural Networks Applied to Survival Prediction in Breast Cancer , 1999, Oncology.

[52]  Fabio H. Nieto,et al.  A note on linear combination of predictors , 2000 .

[53]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[54]  E. Steyerberg Clinical Prediction Models , 2008, Statistics for Biology and Health.

[55]  Onur Genç,et al.  A machine learning-based approach to predict the velocity profiles in small streams , 2015, Water Resources Management.

[56]  Barbara L. Smith,et al.  Breast surgery in stage IV breast cancer: impact of staging and patient selection on overall survival , 2009, Breast Cancer Research and Treatment.

[57]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[58]  C Quantin,et al.  Variation over time of the effects of prognostic factors in a population-based study of colon cancer: comparison of statistical models. , 1999, American journal of epidemiology.

[59]  Ali Dag,et al.  Predicting graft survival among kidney transplant recipients: A Bayesian decision support model , 2018, Decis. Support Syst..

[60]  S. Gunasundari,et al.  Velocity Bounded Boolean Particle Swarm Optimization for improved feature selection in liver and kidney disease diagnosis , 2016, Expert Syst. Appl..

[61]  Tomislav Lipic,et al.  Fine-tuning Convolutional Neural Networks for fine art classification , 2018, Expert Syst. Appl..

[62]  M. Buruian,et al.  Estrogen and progesterone receptor expression in the mammary gland tumors. , 2013, Romanian journal of morphology and embryology = Revue roumaine de morphologie et embryologie.

[63]  Steven Walczak,et al.  Improving prognosis and reducing decision regret for pancreatic cancer treatment using artificial neural networks , 2018, Decis. Support Syst..

[64]  R. Tibshirani,et al.  Gene expression profiling identifies clinically relevant subtypes of prostate cancer. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[65]  Feng Zhou,et al.  EMD2FNN: A strategy combining empirical mode decomposition and factorization machine based neural network for stock market trend prediction , 2019, Expert Syst. Appl..

[66]  Eyyüb Y. Kibis,et al.  Data analytics approaches for breast cancer survivability: comparison of data mining methods , 2017 .

[67]  Kok-Swee Sim,et al.  Convolutional neural network improvement for breast cancer classification , 2019, Expert Syst. Appl..

[68]  Dharminder Kumar,et al.  DATA MINING CLASSIFICATION TECHNIQUES APPLIED FOR BREAST CANCER DIAGNOSIS AND PROGNOSIS , 2011 .

[69]  Ameet Talwalkar,et al.  Sampling Methods for the Nyström Method , 2012, J. Mach. Learn. Res..

[70]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[71]  Dursun Delen,et al.  Predicting breast cancer survivability: a comparison of three data mining methods , 2005, Artif. Intell. Medicine.

[72]  Vivian West,et al.  Computing, Artificial Intelligence and Information Technology Ensemble strategies for a medical diagnostic decision support system: A breast cancer diagnosis application , 2005 .

[73]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[74]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[75]  Ehl Emile Aarts,et al.  Simulated annealing and Boltzmann machines , 2003 .

[76]  J K Kremer,et al.  The pattern of spread and survival in 596 cases of breast cancer related to clinical staging and histological grade. , 1976, Clinical radiology.

[77]  P. Langley Selection of Relevant Features in Machine Learning , 1994 .

[78]  Parag C. Pendharkar,et al.  Association, statistical, mathematical and neural approaches for mining breast cancer patterns , 1999 .

[79]  Tawfik T. El-Midany,et al.  A proposed framework for control chart pattern recognition in multivariate process using artificial neural networks , 2010, Expert Syst. Appl..

[80]  Rohit J. Kate,et al.  Stage-specific predictive models for breast cancer survivability , 2017, Int. J. Medical Informatics.