Using undiagnosed data to enhance computerized breast cancer analysis with a three stage data labeling method

A novel three stage Semi-Supervised Learning (SSL) approach is proposed for improving performance of computerized breast cancer analysis with undiagnosed data. These three stages include: (1) Instance selection, which is barely used in SSL or computerized cancer analysis systems, (2) Feature selection and (3) Newly designed ‘Divide Co-training’ data labeling method. 379 suspicious early breast cancer area samples from 121 mammograms were used in our research. Our proposed ‘Divide Co-training’ method is able to generate two classifiers through split original diagnosed dataset (labeled data), and label the undiagnosed data (unlabeled data) when they reached an agreement. The highest AUC (Area Under Curve, also called Az value) using labeled data only was 0.832 and it increased to 0.889 when undiagnosed data were included. The results indicate instance selection module could eliminate untypical data or noise data and enhance the following semi-supervised data labeling performance. Based on analyzing different data sizes, it can be observed that the AUC and accuracy go higher with the increase of either diagnosed data or undiagnosed data, and reach the best improvement (ΔAUC = 0.078, ΔAccuracy = 7.6%) with 40 of labeled data and 300 of unlabeled data.

[1]  Paisarn Muneesawang,et al.  Advances in Multimedia Information Processing - PCM 2009, 10th Pacific Rim Conference on Multimedia, Bangkok, Thailand, December 15-18, 2009 Proceedings , 2009, PCM.

[2]  C. Floyd,et al.  Evaluation of information-theoretic similarity measures for content-based retrieval and detection of masses in mammograms. , 2006, Medical physics.

[3]  Kristen Grauman,et al.  Watch, Listen & Learn: Co-training on Captioned Images and Videos , 2008, ECML/PKDD.

[4]  Dansheng Song,et al.  Computer-aided mass detection based on ipsilateral multiview mammograms. , 2007, Academic radiology.

[5]  Jonathan M. Garibaldi,et al.  A Comparison of Three Different Methods for Classification of Breast Cancer Data , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[6]  R A Clark,et al.  False-positive reduction in CAD mass detection using a competitive classification strategy. , 2001, Medical physics.

[7]  Claire Cardie,et al.  Limitations of Co-Training for Natural Language Learning from Large Datasets , 2001, EMNLP.

[8]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[9]  W Qian,et al.  Digital mammography: wavelet transform and Kalman-filtering neural network in mass segmentation and detection. , 2001, Academic radiology.

[10]  Wei Qian,et al.  Image feature extraction for mass detection in digital mammography: Influence of wavelet analysis , 1999 .

[11]  Bin Zheng,et al.  Improving performance of computer-aided detection scheme by combining results from two machine learning classifiers. , 2009, Academic radiology.

[12]  Stan Matwin,et al.  Email classification with co-training , 2011, CASCON.

[13]  Weiqiang Wang,et al.  Weakly-Supervised Violence Detection in Movies with Audio and Video Based Co-training , 2009, PCM.

[14]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[15]  N. Petrick,et al.  Computer-aided classification of mammographic masses and normal tissue: linear discriminant analysis in texture feature space. , 1995, Physics in medicine and biology.

[16]  Mu-Chen Chen,et al.  Prediction model building and feature selection with support vector machines in breast cancer diagnosis , 2008, Expert Syst. Appl..

[17]  Y H Chang,et al.  Incorporation of a set enumeration trees-based classifier into a hybrid computer-assisted diagnosis scheme for mass detection. , 1998, Academic radiology.

[18]  Sanjoy Dasgupta,et al.  PAC Generalization Bounds for Co-training , 2001, NIPS.