Breast cancer risk score: a data mining approach to improve readability

According to the World Health Organization, starting from 2010, cancer will become the leading cause of death worldwide. Prevention of major cancer localizations through a quantified assessment of risk factors is a major concern in order to decrease their impact in our society. Our objective is to test the performances of a modeling method easily readable by a physician. In this article, we follow a data mining process to build a reliable assessment tool for primary breast cancer risk. A k-nearest-neighbor algorithm is used to compute a risk score for different profiles from a public database. We empirically show that it is possible to achieve the same performances than logistic regressions with less parameters and a more easily readable model. The process includes the intervention of a domain expert who helps to select one of the numerous model variations by combining at best, physician expectations and performances. A risk score is made up of four parameters: age, breast density, number of affected first degree relatives and prone to breast biopsy. Detection performance measured with the area under the ROC curve is 0.637.

[1]  Karla Kerlikowske,et al.  Prospective breast cancer risk prediction model for women undergoing screening mammography. , 2006, Journal of the National Cancer Institute.

[2]  K. Kerlikowske,et al.  Breast Cancer Surveillance Consortium: a national mammography screening and outcomes database. , 1997, AJR. American journal of roentgenology.

[3]  M. Kallergi,et al.  Simulation model of mammographic calcifications based on the American College of Radiology Breast Imaging Reporting and Data System, or BIRADS. , 1998, Academic radiology.

[4]  José Antonio Gómez-Ruiz,et al.  A combined neural network and decision trees model for prognosis of breast cancer relapse , 2003, Artif. Intell. Medicine.

[5]  James P. Egan,et al.  Signal detection theory and ROC analysis , 1975 .

[6]  J. Kaprio,et al.  Environmental and heritable factors in the causation of cancer--analyses of cohorts of twins from Sweden, Denmark, and Finland. , 2000, The New England journal of medicine.

[7]  M. Sporn The war on cancer , 1996, The Lancet.

[8]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[9]  M. Gail,et al.  Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. , 1989, Journal of the National Cancer Institute.

[10]  B. Stewart,et al.  World Cancer Report , 2003 .

[11]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[12]  Fh collaborative teams Mammographic surveillance in women younger than 50 years who have a family history of breast cancer: tumour characteristics and projected effect on mortality in the prospective, single-arm, FH01 study. , 2010, The Lancet. Oncology.

[13]  Stefano Calza,et al.  Gail model for prediction of absolute risk of invasive breast cancer: independent evaluation in the Florence-European Prospective Investigation Into Cancer and Nutrition cohort. , 2006, Journal of the National Cancer Institute.

[14]  Ralescu Anca,et al.  ISSUES IN MINING IMBALANCED DATA SETS - A REVIEW PAPER , 2005 .

[15]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[16]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[17]  Joydeep Ghosh,et al.  Generative Oversampling for Mining Imbalanced Datasets , 2007, DMIN.

[18]  Hiroshi Tanaka,et al.  Comparison of Seven Algorithms to Predict Breast Cancer Survival( Contribution to 21 Century Intelligent Technologies and Bioinformatics) , 2008 .

[19]  Thomas Reinartz,et al.  CRISP-DM 1.0: Step-by-step data mining guide , 2000 .

[20]  J Benichou,et al.  Validation studies for models projecting the risk of invasive and total breast cancer incidence. , 1999, Journal of the National Cancer Institute.

[21]  R. Rubin The war on cancer. , 1996, U.S. news & world report.

[22]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[23]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[24]  Thanh-Nghi Do,et al.  Using Local Node Information in Decision Trees: Coupling a Local Labeling Rule with an Off-centered Entropy , 2008, DMIN.

[25]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, ICDM.

[26]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .