A feature-fusion framework of clinical, genomics, and histopathological data for METABRIC breast cancer subtype classification

Abstract Breast cancer is the most common cancer type attacking women worldwide. Also, breast cancer has been phenotypically classified into five subtypes. Each subtype group has unique characteristics that demonstrate the heterogeneity present within the breast cancer tumour. In 2012, the American Association for Cancer Research provided a population based molecular integrative clusters for the METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) dataset, resulting in ten subtypes. Previous work on the METABRIC dataset used only gene expression data to figure out the effective genes for each subtype, without applying integration to benefit from all data sources. The objective of this paper is to present a breast cancer subtype classification model that applies feature fusion on the METABRIC datasets, namely clinical, gene expression, Copy Number Aberrations (CNA), Copy Number Variations (CNV), and histopathological images. State-of-the-art machine learning classifiers were applied on different data profiles, including Linear-SVM, Radial-SVM, Random Forests (RF), Ensemble SVM (E-SVM), and Boosting. The highest accuracy achieved for IntClust subtyping was 88.36% using Linear-SVM, applied on the data profile with features fused from the clinical, gene expression, CNA, and CNV datasets, with a Jaccard and Dice scores of 0.802 and 0.8835, respectively. On the other hand, for the Pam50 subtyping, an accuracy of 97.1% was achieved, Jaccard score ranging from 0.9439 to 0.9472, and Dice score of 0.971, using Linear-SVM and E-SVM classifiers, with several data profiles that include features from histopathological images. Conclusively, the significance of our study is to validate that using feature fusion from various METABRIC datasets improves breast cancer subtypes classification performance. Moreover, histopathological images give promising results on Pam50 subtypes, and it is expected to improve the accuracy for IntClust subtyping when applied on a higher population.

[1]  Yoonkyung Lee,et al.  Eigen‐analysis of nonlinear PCA with polynomial kernels , 2013, Stat. Anal. Data Min..

[2]  Vili Podgorelec,et al.  Swarm Intelligence Algorithms for Feature Selection: A Review , 2018, Applied Sciences.

[3]  Alioune Ngom,et al.  A new feature selection approach for optimizing prediction models, applied to breast cancer subtype classification , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[4]  Luis Rueda,et al.  Identification of discriminative genes for predicting breast cancer subtypes , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[5]  P. Pearson,et al.  Germline DNA copy number variation in familial and early-onset breast cancer , 2012, Breast Cancer Research.

[6]  Xiaosheng Wang,et al.  Classification of triple-negative breast cancers based on Immunogenomic profiling , 2018, Journal of Experimental & Clinical Cancer Research.

[7]  Catarina Eloy,et al.  Classification of breast cancer histology images using Convolutional Neural Networks , 2017, PloS one.

[8]  A. Børresen-Dale,et al.  Breast Cancer Molecular Stratification: From Intrinsic Subtypes to Integrative Clusters. , 2017, The American journal of pathology.

[9]  Vinod Kumar Jain,et al.  Correlation feature selection based improved-Binary Particle Swarm Optimization for gene selection and cancer classification , 2018, Appl. Soft Comput..

[10]  Witold R. Rudnicki,et al.  Feature Selection with the Boruta Package , 2010 .

[11]  Alexandre Mendes,et al.  Identification of Breast Cancer Subtypes Using Multiple Gene Expression Microarray Datasets , 2011, Australasian Conference on Artificial Intelligence.

[12]  Qian Liu,et al.  Automatic classification of ovarian cancer types from cytological images using deep convolutional neural networks , 2018, Bioscience reports.

[13]  Mammographic density and survival in interval breast cancers , 2013, Breast Cancer Research.

[14]  Lori A. Post,et al.  Strategies for Dealing with Missing Data in Clinical Trials: From Design to Analysis , 2013, The Yale journal of biology and medicine.

[15]  Sanyam Shukla,et al.  Analysis of k-Fold Cross-Validation over Hold-Out Validation on Colossal Datasets for Quality Classification , 2016, 2016 IEEE 6th International Conference on Advanced Computing (IACC).

[16]  Mustafa Agah Tekindal,et al.  Comparison of Test Statistics of Nonnormal and Unbalanced Samples for Multivariate Analysis of Variance in terms of Type-I Error Rates , 2019, Comput. Math. Methods Medicine.

[17]  R. Trevethan,et al.  Sensitivity, Specificity, and Predictive Values: Foundations, Pliabilities, and Pitfalls in Research and Practice , 2017, Front. Public Health.

[18]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[19]  A. Jemal,et al.  Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries , 2018, CA: a cancer journal for clinicians.

[20]  A. Huisman,et al.  Automatic Nuclei Segmentation in H&E Stained Breast Cancer Histopathology Images , 2013, PloS one.

[21]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[22]  Junzhong Gu,et al.  Data Mining Based on Colon Cancer Gene Expression Profiles , 2011, 2011 International Conference on Computational and Information Sciences.

[23]  Alioune Ngom,et al.  A new compact set of biomarkers for distinguishing among ten breast cancer subtypes , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[24]  Alessandro Verri,et al.  Pattern Recognition with Support Vector Machines , 2002, Lecture Notes in Computer Science.

[25]  Daisuke Komura,et al.  Machine Learning Methods for Histopathological Image Analysis , 2017, Computational and structural biotechnology journal.

[26]  Joel H. Saltz,et al.  Patch-Based Convolutional Neural Network for Whole Slide Tissue Image Classification , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Hussein Hijazi,et al.  A classification framework applied to cancer gene expression profiles. , 2013, Journal of healthcare engineering.

[28]  Wei-Chung Cheng,et al.  Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm , 2014, BMC Bioinformatics.

[29]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[30]  Nashwa El-Bendary,et al.  Epithelial Ovarian Cancer Stage Subtype Classification using Clinical and Gene Expression Integrative Approach , 2018 .

[31]  Carlos Caldas,et al.  A new genome‐driven integrated classification of breast cancer and its implications , 2013, The EMBO journal.

[32]  Sabri Boughorbel,et al.  Model Comparison for Breast Cancer Prognosis Based on Clinical Data , 2016, PloS one.

[33]  Jingjing Liu,et al.  Cancer classification based on microarray gene expression data using a principal component accumulation method , 2011 .

[34]  A. Ruifrok,et al.  Quantification of histochemical staining by color deconvolution. , 2001, Analytical and quantitative cytology and histology.

[35]  Jean-Michel Poggi,et al.  Variable selection using random forests , 2010, Pattern Recognit. Lett..

[36]  V. Thada,et al.  Comparison of Jaccard, Dice, Cosine Similarity Coefficient To Find Best Fitness Value for Web Retrieved Documents Using Genetic Algorithm , 2013 .

[37]  D. Zardavas,et al.  The past and future of breast cancer treatment—from the papyrus to individualised treatment approaches , 2017, Ecancermedicalscience.

[38]  Raymond G. Cavalcante,et al.  Identification of Copy Number Aberrations in Breast Cancer Subtypes Using Persistence Topology , 2015, Microarrays.

[39]  Eric P. Winer,et al.  Breast Cancer Treatment: A Review , 2019, JAMA.

[40]  Tao Liu,et al.  Efficient feature selection and classification for microarray data , 2018, PloS one.

[41]  Hung-Wen Chiu,et al.  Cancer adjuvant chemotherapy strategic classification by artificial neural network with gene expression data: An example for non-small cell lung cancer , 2015, J. Biomed. Informatics.

[42]  Dimitrios I. Fotiadis,et al.  Machine learning applications in cancer prognosis and prediction , 2014, Computational and structural biotechnology journal.

[43]  O. Stegle,et al.  Deep learning for computational biology , 2016, Molecular systems biology.

[44]  N. Houssami,et al.  The epidemiology, radiology and biological characteristics of interval breast cancers in population mammography screening , 2017, npj Breast Cancer.

[45]  Regina Berretta,et al.  The Discovery of Novel Biomarkers Improves Breast Cancer Intrinsic Subtype Prediction and Reconciles the Labels in the METABRIC Data Set , 2015, PloS one.

[46]  Daoliang Li,et al.  An Adaptive Thresholding algorithm of field leaf image , 2013 .

[47]  György Kovács,et al.  An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets , 2019, Appl. Soft Comput..

[48]  Tianfu Wang,et al.  Breast Cancer Detection and Diagnosis Using Mammographic Data: Systematic Review , 2019, Journal of medical Internet research.

[49]  Mark A. Ragan,et al.  Breast cancer classification: linking molecular mechanisms to disease prognosis , 2015, Briefings Bioinform..

[50]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[51]  Juan Humberto Sossa Azuela,et al.  Improving pattern classification of DNA microarray data by using PCA and logistic regression , 2016, Intell. Data Anal..

[52]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[53]  Quan Wang,et al.  Kernel Principal Component Analysis and its Applications in Face Recognition and Active Shape Models , 2012, ArXiv.

[54]  Dechang Chen,et al.  Gene Expression Data Classification With Kernel Principal Component Analysis , 2005, Journal of biomedicine & biotechnology.

[55]  D. Wilkins,et al.  lncRNA Gene Signatures for Prediction of Breast Cancer Intrinsic Subtypes and Prognosis , 2018, Genes.

[56]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[57]  Nilanjan Dey,et al.  A Survey of Data Mining and Deep Learning in Bioinformatics , 2018, Journal of Medical Systems.

[58]  G. Turashvili,et al.  Tumor Heterogeneity in Breast Cancer , 2017, Front. Med..

[59]  Max A. Viergever,et al.  Breast Cancer Histopathology Image Analysis: A Review , 2014, IEEE Transactions on Biomedical Engineering.

[60]  Rajeev Kumar,et al.  Receiver operating characteristic (ROC) curve for medical researchers , 2011, Indian pediatrics.