论文信息 - Breast cancer risk score: a data mining approach to improve readability - 字舞流文

Breast cancer risk score: a data mining approach to improve readability

According to the World Health Organization, starting from 2010, cancer will become the leading cause of death worldwide. Prevention of major cancer localizations through a quantified assessment of risk factors is a major concern in order to decrease their impact in our society. Our objective is to test the performances of a modeling method easily readable by a physician. In this article, we follow a data mining process to build a reliable assessment tool for primary breast cancer risk. A k-nearest-neighbor algorithm is used to compute a risk score for different profiles from a public database. We empirically show that it is possible to achieve the same performances than logistic regressions with less parameters and a more easily readable model. The process includes the intervention of a domain expert who helps to select one of the numerous model variations by combining at best, physician expectations and performances. A risk score is made up of four parameters: age, breast density, number of affected first degree relatives and prone to breast biopsy. Detection performance measured with the area under the ROC curve is 0.637.

Laurent Brisson | Philippe Lenca | Emilien Gauthier | Stéphane Ragusa | S. Ragusa | P. Lenca | Laurent Brisson | E. Gauthier

[1] Karla Kerlikowske,et al. Prospective breast cancer risk prediction model for women undergoing screening mammography. , 2006, Journal of the National Cancer Institute.

[2] K. Kerlikowske,et al. Breast Cancer Surveillance Consortium: a national mammography screening and outcomes database. , 1997, AJR. American journal of roentgenology.

[3] M. Kallergi,et al. Simulation model of mammographic calcifications based on the American College of Radiology Breast Imaging Reporting and Data System, or BIRADS. , 1998, Academic radiology.

[4] José Antonio Gómez-Ruiz,et al. A combined neural network and decision trees model for prognosis of breast cancer relapse , 2003, Artif. Intell. Medicine.

[5] James P. Egan,et al. Signal detection theory and ROC analysis , 1975 .

[6] J. Kaprio,et al. Environmental and heritable factors in the causation of cancer--analyses of cohorts of twins from Sweden, Denmark, and Finland. , 2000, The New England journal of medicine.

[7] M. Sporn. The war on cancer , 1996, The Lancet.

[8] JapkowiczNathalie,et al. The class imbalance problem: A systematic study , 2002 .

[9] M. Gail,et al. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. , 1989, Journal of the National Cancer Institute.

[10] B. Stewart,et al. World Cancer Report , 2003 .

[11] Tom Fawcett,et al. An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[12] Fh collaborative teams. Mammographic surveillance in women younger than 50 years who have a family history of breast cancer: tumour characteristics and projected effect on mortality in the prospective, single-arm, FH01 study. , 2010, The Lancet. Oncology.

[13] Stefano Calza,et al. Gail model for prediction of absolute risk of invasive breast cancer: independent evaluation in the Florence-European Prospective Investigation Into Cancer and Nutrition cohort. , 2006, Journal of the National Cancer Institute.

[14] Ralescu Anca,et al. ISSUES IN MINING IMBALANCED DATA SETS - A REVIEW PAPER , 2005 .

[15] Pedro M. Domingos. MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[16] Peter E. Hart,et al. Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[17] Joydeep Ghosh,et al. Generative Oversampling for Mining Imbalanced Datasets , 2007, DMIN.

[18] Hiroshi Tanaka,et al. Comparison of Seven Algorithms to Predict Breast Cancer Survival( Contribution to 21 Century Intelligent Technologies and Bioinformatics) , 2008 .

[19] Thomas Reinartz,et al. CRISP-DM 1.0: Step-by-step data mining guide , 2000 .

[20] J Benichou,et al. Validation studies for models projecting the risk of invasive and total breast cancer incidence. , 1999, Journal of the National Cancer Institute.

[21] R. Rubin. The war on cancer. , 1996, U.S. news & world report.

[22] Nathalie Japkowicz,et al. The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[23] Foster J. Provost,et al. Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[24] Thanh-Nghi Do,et al. Using Local Node Information in Decision Trees: Coupling a Local Labeling Rule with an Off-centered Entropy , 2008, DMIN.

[25] Zhi-Hua Zhou,et al. Exploratory Under-Sampling for Class-Imbalance Learning , 2006, ICDM.

[26] J. L. Hodges,et al. Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .