Correlation based Feature Selection impact on the classification of breast cancer patients response to neoadjuvant chemotherapy

The availability of a huge number of variables is not always associated to better classification performances, as some of them can be redundant, irrelevant or source of noise. For this reason, a Feature Selection (FS) step is often applied to high-dimensional datasets. FS based on correlation relies on the idea that “good feature subsets contain features highly correlated with the class yet uncorrelated with each other”. However, the main problem of this kind of approach is to define a threshold from which considering two variables correlated. In this study, we evaluated the impact of different thresholds on the performances of two classifiers trained to predict response to neoadjuvant chemotherapy (from grade 1 to 5) of 44 patients with breast cancer. First, 27 texture features were computed on the largest slices belonging to the segmented tumor on the pretreatment dynamic contrast enhanced-MRI. Then, we applied a FS algorithm that identifies the couples of variables with absolute value of the linear correlation coefficient above a given threshold and removed, for each couple, the variable less correlated with the response to the neoadjuvant chemotherapy. We tested correlation thresholds ranging from 1 to 0.8 with intervals of 0.01, and we used each obtained subset to construct a Decision Tree (DT) classifier and a Linear Regression Model (LRM). Our results showed that the removal of highly correlated variables (absolute value of the correlation coefficient >0.97) produced a reduction of the DT performance of about 10%. Although the LRM was not able to reach acceptable results in terms of chemotherapy response prediction (accuracy=40.9%), its intrinsic linearity allowed to be more stable to linear redundancy removal.

[1]  Fiona J Gilbert,et al.  Neoadjuvant chemotherapy in breast cancer: significantly enhanced response with docetaxel. , 2002, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[2]  Lloyd A. Smith,et al.  Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper , 1999, FLAIRS.

[3]  Gabriella Balestra,et al.  Feature Extraction by QuickReduct Algorithm: Assessment of Migraineurs Neurovascular Pattern , 2011 .

[4]  Gabriella Balestra,et al.  Data quality improvement of a multicenter clinical trial dataset , 2017, 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[5]  Robert M. Haralick,et al.  Textural Features for Image Classification , 1973, IEEE Trans. Syst. Man Cybern..

[6]  Kesari Verma,et al.  Review of Feature Selection Algorithms for Breast Cancer Ultrasound Image , 2015, New Trends in Intelligent Information and Database Systems.

[7]  Francesco Porpiglia,et al.  A fully automatic computer aided diagnosis system for peripheral zone prostate cancer detection using multi-parametric magnetic resonance imaging , 2015, Comput. Medical Imaging Graph..

[8]  Richard W. Conners,et al.  A Theoretical Comparison of Texture Algorithms , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Daniele Regge,et al.  Monitoring Response to Primary Chemotherapy in Breast Cancer using Dynamic Contrast-enhanced Magnetic Resonance Imaging , 2004, Breast Cancer Research and Treatment.

[10]  秀俊 松井,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2014 .

[11]  Paul Kinahan,et al.  Radiomics: Images Are More than Pictures, They Are Data , 2015, Radiology.

[12]  A. Hutcheon,et al.  A new histological grading system to assess response of breast cancers to primary chemotherapy: prognostic significance and survival. , 2003, Breast.

[13]  Daniele Regge,et al.  A computer-aided diagnosis (CAD) scheme for pretreatment prediction of pathological response to neoadjuvant therapy using dynamic contrast-enhanced MRI texture features. , 2017, The British journal of radiology.

[14]  Habibollah Haron,et al.  Supervised, Unsupervised, and Semi-Supervised Feature Selection: A Review on Gene Selection , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[15]  R. Divya,et al.  Multiple time series clinical data with frequency measurement and feature selection , 2016, 2016 IEEE International Conference on Advances in Computer Applications (ICACA).

[16]  Gabriella Balestra,et al.  CAROTID WALL MEASUREMENT AND ASSESSMENT BASED ON PIXEL-BASED AND LOCAL TEXTURE DESCRIPTORS , 2016 .

[17]  Sheng-yi Jiang,et al.  Efficient feature selection based on correlation measure between continuous and discrete features , 2016, Inf. Process. Lett..

[18]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .