Author ' s personal copy Relabeling algorithm for retrieval of noisy instances and improving prediction quality

A relabeling algorithm for retrieval of noisy instances with binary outcomes is presented. The relabeling algorithm iteratively retrieves, selects, and re-labels data instances (i.e., transforms a decision space) to improve prediction quality. It emphasizes knowledge generalization and confidence rather than classification accuracy. A confidence index incorporating classification accuracy, prediction error, impurities in the relabeled dataset, and cluster purities was designed. The proposed approach is illustrated with a binary outcome dataset and was successfully tested on the standard benchmark four UCI repository dataset as well as bladder cancer immunotherapy data. A subset of the most stable instances (i.e., 7% to 51% of the sample) with high confidence (i.e., between 64%-99.44%) was identified for each application along with most noisy instances. The domain experts and the extracted knowledge validated the relabeled instances and corresponding confidence indexes. The relabeling algorithm with some modifications can be applied to other medical, industrial, and service domains.

[1]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[2]  Andrew Kusiak,et al.  Patient-recognition data-mining model for BCG-plus interferon immunotherapy bladder cancer treatment , 2006, Comput. Biol. Medicine.

[3]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[4]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[5]  Natsuki Oka,et al.  Learning regular and irregular examples separately , 1993, Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan).

[6]  M. O'Donnell,et al.  BCG immunotherapy of bladder cancer: 20 years on , 1999, The Lancet.

[7]  W. N. Street,et al.  Xcyt: a System for Remote Cytological Diagnosis and Prognosis of Breast Cancer , 2000 .

[8]  P. Wingo,et al.  Cancer statistics, 1997 , 1997, CA: a cancer journal for clinicians.

[9]  Lloyd A. Smith,et al.  Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper , 1999, FLAIRS.

[10]  John Holland,et al.  Adaptation in Natural and Artificial Sys-tems: An Introductory Analysis with Applications to Biology , 1975 .

[11]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[12]  N. Dubrawsky Cancer statistics , 1989, CA: a cancer journal for clinicians.

[13]  María José del Jesús,et al.  Genetic learning of fuzzy rule-based classification systems cooperating with fuzzy reasoning methods , 1998, Int. J. Intell. Syst..

[14]  Tony R. Martinez,et al.  Instance Pruning Techniques , 1997, ICML.

[15]  JOHANNES FÜRNKRANZ,et al.  Separate-and-Conquer Rule Learning , 1999, Artificial Intelligence Review.

[16]  LallichStéphane,et al.  Identifying and Handling Mislabelled Instances , 2004 .

[17]  Alexander Gammerman,et al.  Prediction algorithms and confidence measures based on algorithmic randomness theory , 2002, Theor. Comput. Sci..

[18]  Ian Witten,et al.  Data Mining , 2000 .

[19]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[20]  F. Herrera,et al.  Genetic learning of fuzzy rule‐based classification systems cooperating with fuzzy reasoning methods , 1998 .

[21]  Amir F. Atiya,et al.  A penalized likelihood based pattern classification algorithm , 2009, Pattern Recognit..

[22]  S. Chakraborty Bayesian semi-supervised learning with support vector machine , 2011 .

[23]  José Miguel Mantas,et al.  A procedure for improving generalization in classification trees , 2002, Neurocomputing.

[24]  Goldberg,et al.  Genetic algorithms , 1993, Robust Control Systems with Genetic Algorithms.

[25]  Hussein Almuallim,et al.  An Efficient Algorithm for Optimal Pruning of Decision Trees , 1996, Artif. Intell..

[26]  Richard S. Johannes,et al.  Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus , 1988 .

[27]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[28]  Martin Steffen,et al.  Author ' s personal copy ARTICLE IN PRESS , 2009 .

[29]  Juan Luis Castro,et al.  SEPARATE: a machine learning method based on semi-global partitions , 2000, IEEE Trans. Neural Networks Learn. Syst..

[30]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[31]  Fabrice Muhlenbach,et al.  Improving Classification by Removing or Relabeling Mislabeled Instances , 2002, ISMIS.

[32]  Xindong Wu,et al.  Eliminating Class Noise in Large Datasets , 2003, ICML.

[33]  Taylor Murray,et al.  Cancer statistics, 1998 , 1998, CA: a cancer journal for clinicians.

[34]  John Mingers,et al.  An Empirical Comparison of Pruning Methods for Decision Tree Induction , 1989, Machine Learning.

[35]  M. O'Donnell,et al.  Interim results from a national multicenter phase II trial of combination bacillus Calmette-Guerin plus interferon alfa-2b for superficial bladder cancer. , 2004, The Journal of urology.

[36]  Tony R. Martinez,et al.  Combining cross-validation and confidence to measure fitness , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[37]  Johannes Fürnkranz,et al.  Pruning Algorithms for Rule Learning , 1997, Machine Learning.

[38]  William F. Punch,et al.  Knowledge discovery in medical and biological datasets using a hybrid Bayes classifier/evolutionary algorithm , 2003, IEEE Trans. Syst. Man Cybern. Part B.