Finding characteristics of exceptional breast cancer subpopulations using subgroup mining and statistical test

Abstract Breast cancer is one of the most prevalent types of cancer among women. With increased emphasis towards cancer related research, many data-driven research works have been conducted for classifying cancer diagnosis, survival, or recurrence. Unlike existing literature, this study aims to discover interesting subgroup patterns of long-term and short-term survival from the breast cancer incidence data of the SEER (Surveillance, Epidemiology, and End Results) Program. We present a rule induction method for subgroup discovery, which can effectively find subgroup patterns by focusing on local exceptionality detection in contrast to global models. The significance of subgroup patterns discovered is examined with statistical tests. Furthermore the characteristics of two exceptional high and low survival groups are compared by examining the descriptive statistics of prognostic factors in each group. The case study’s results show that the proposed subgroup mining and statistical test approach is a promising technique for clinical and medical data analytics.

[1]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[2]  E. Burnside,et al.  A logistic regression model based on the national mammography database format to aid breast cancer diagnosis. , 2009, AJR. American journal of roentgenology.

[3]  R. Prentice,et al.  Regression analysis of grouped survival data with application to breast cancer data. , 1978, Biometrics.

[4]  María José del Jesús,et al.  An overview on subgroup discovery: foundations and applications , 2011, Knowledge and Information Systems.

[5]  B. Hankey,et al.  The surveillance, epidemiology, and end results program: a national resource. , 1999, Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology.

[6]  D. R. Lewis,et al.  Cancer survival and incidence from the Surveillance, Epidemiology, and End Results (SEER) program. , 2003, The oncologist.

[7]  Zoheir Ezziane,et al.  Applications of artificial intelligence in bioinformatics: A review , 2006, Expert Syst. Appl..

[8]  Dursun Delen,et al.  Predicting breast cancer survivability: a comparison of three data mining methods , 2005, Artif. Intell. Medicine.

[9]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[10]  Florian Lemmerich,et al.  Fast Subgroup Discovery for Continuous Target Concepts , 2009, ISMIS.

[11]  Willi Klösgen,et al.  Spatial Subgroup Mining Integrated in an Object-Relational Spatial Database , 2002, PKDD.

[12]  Nicholas I. Fisher,et al.  Bump hunting in high-dimensional data , 1999, Stat. Comput..

[13]  Thora Jonsdottir,et al.  The feasibility of constructing a Predictive Outcome Model for breast cancer using the tools of data mining , 2008, Expert Syst. Appl..

[14]  G. Hortobagyi,et al.  Multivariate analysis of prognostic factors in metastatic breast cancer. , 1983, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[15]  Frank Puppe,et al.  Introspective Subgroup Analysis for Interactive Knowledge Refinement , 2006, FLAIRS Conference.

[16]  F. Huang,et al.  Breast cancer survivability via AdaBoost algorithms , 2008 .

[17]  Ta-Cheng Chen,et al.  A GAs based approach for mining breast cancer pattern , 2006, Expert Syst. Appl..

[18]  Padhraic Smyth,et al.  Knowledge Discovery and Data Mining: Towards a Unifying Framework , 1996, KDD.

[19]  Branko Kavsek,et al.  APRIORI-SD: ADAPTING ASSOCIATION RULE LEARNING TO SUBGROUP DISCOVERY , 2006, IDA.

[20]  M. Cevdet Ince,et al.  An expert system for detection of breast cancer based on association rules and neural network , 2009, Expert Syst. Appl..

[21]  Martin Atzmüller,et al.  Subgroup discovery , 2005, Künstliche Intell..

[22]  A. Jemal,et al.  Cancer treatment and survivorship statistics, 2014 , 2014, CA: a cancer journal for clinicians.

[23]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[24]  Hyunjung Shin,et al.  Robust predictive model for evaluating breast cancer survivability , 2013, Eng. Appl. Artif. Intell..

[25]  James A. Rodger,et al.  Discovery of medical Big Data analytics: Improving the prediction of traumatic brain injury survival rates by data mining Patient Informatics Processing Software Hybrid Hadoop Hive , 2015 .

[26]  Frank Puppe,et al.  SD-Map - A Fast Algorithm for Exhaustive Subgroup Discovery , 2006, PKDD.

[27]  Geoffrey I. Webb,et al.  Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining , 2009, J. Mach. Learn. Res..

[28]  Chih-Fong Tsai,et al.  SVM and SVM Ensembles in Breast Cancer Prediction , 2017, PloS one.

[29]  Yuehjen E. Shao,et al.  Mining the breast cancer pattern using artificial neural networks and multivariate adaptive regression splines , 2004, Expert Syst. Appl..

[30]  Kouros Owzar,et al.  Supplementary Issue: Array Platform Modeling and Analysis (b) next Generation Distributed Computing for Cancer Research Scalable Computing Systems , 2022 .

[31]  A. Jemal,et al.  Cancer treatment and survivorship statistics, 2016 , 2016, CA: a cancer journal for clinicians.

[32]  Parag C. Pendharkar,et al.  Association, statistical, mathematical and neural approaches for mining breast cancer patterns , 1999 .

[33]  D. R. Umesh,et al.  Association rule mining based predicting breast cancer recurrence on SEER breast cancer data , 2015, 2015 International Conference on Emerging Research in Electronics, Computer Science and Technology (ICERECT).

[34]  Peter A. Flach,et al.  Decision Support Through Subgroup Discovery: Three Case Studies and the Lessons Learned , 2004, Machine Learning.

[35]  Stefan Wrobel,et al.  An Algorithm for Multi-relational Discovery of Subgroups , 1997, PKDD.