Decontamination of Training Samples for Supervised Pattern Recognition Methods

The present work discusses what have been called 'imperfectly supervised situations': pattern recognition applications where the assumption of label correctness does not hold for all the elements of the training sample. A methodology for contending with these practical situations and to avoid their negative impact on the performance of supervised methods is presented. This methodology can be regarded as a cleaning process removing some suspicious instances of the training sample or correcting the class labels of some others while retaining them. It has been conceived for doing classification with the Nearest Neighbor rule, a supervised nonparametric classifier that combines conceptual simplicity and an asymptotic error rate bounded in terms of the optimal Bayes error. However, initial experiments concerning the learning phase of a Multilayer Perceptron (not reported in the present work) seem to indicate a broader applicability. Results with both simulated and real data sets are presented to support the methodology and to clarify the ideas behind it. Related works are briefly reviewed and some issues deserving further research are also exposed.

[1]  P. Hardin Parametric and nearest-neighbor methods for hybrid classification: a comparison of pixel assignment accuracy , 1994 .

[2]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[3]  David J. Hand,et al.  Construction and Assessment of Classification Rules , 1997 .

[4]  John R. G. Townshend,et al.  Advances in classification for land cover mapping using SPOT HRV imagery , 1991 .

[5]  Ching Y. Suen,et al.  A new method of optimizing prototypes for nearest neighbor classifiers using a multi-layer network , 1995, Pattern Recognit. Lett..

[6]  William D. Goran,et al.  An automated, objective procedure for selecting representative field sample sites , 1990 .

[7]  R. Fuller,et al.  Statistical problems in the discrimination of land cover from satellite images : a case study in lowland Britain , 1992 .

[8]  C. B. Chittineni Learning with imperfectly labeled patterns , 1980, Pattern Recognit..

[9]  Perry J. Hardin,et al.  Fast Nearest Neighbor Classification Methods for Multispectral Imagery , 1992 .

[10]  V. Sridhar,et al.  Some applications of clustering in the design of neural networks , 1995, Pattern Recognit. Lett..

[11]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[12]  Yee-Hong Yang,et al.  Classifier design with incomplete knowledge , 1998, Pattern Recognit..

[13]  Kiichi Urahama,et al.  Gradient descent learning of nearest neighbor classifiers with outlier rejection , 1995, Pattern Recognit..

[14]  Ricardo Barandela Alonso Una metodologia para el reconocimiento de patrones en tareas geologo - geofisicas , 1995 .

[15]  Giles M. Foody,et al.  Directed ground survey for improved maximum likelihood classification of remotely sensed data , 1990 .

[16]  Jack Koplowitz,et al.  On the relation of performance to editing in nearest neighbor rules , 1981, Pattern Recognit..

[17]  Chih-Cheng Hung Competitive learning networks for unsupervised training , 1993 .

[18]  George H. John Enhancements to the data mining process , 1997 .

[19]  Filiberto Pla,et al.  Prototype selection for the nearest neighbour rule through proximity graphs , 1997, Pattern Recognit. Lett..

[20]  Yurij S. Kharin,et al.  Filtering of multivariate samples containing "outliers" for clustering , 1998, Pattern Recognit. Lett..

[21]  Ricardo Barandela The nearest neighbor rule: an empirical study of its methodological aspects , 1987 .

[22]  T. Hassard,et al.  Applied Linear Regression , 2005 .

[23]  T. M. Lillesand,et al.  Semi-automated training field extraction and analysis for efficient digital image classification , 1989 .

[24]  N. Campbell,et al.  Derivation and applications of probabilistic measures of class membership from the maximum-likelihood classification , 1992 .

[25]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[26]  Kanellopoulos Ioannis,et al.  Integration of Neural and Statistical Approaches in Spatial-Data Classification , 1995 .

[27]  G. Krishna,et al.  Learning with a mutualistic teacher , 1979, Pattern Recognit..

[28]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[29]  Ricardo Barandela Alonso Metodos de reconocimiento de patrones en la solucion de tareas geologo - geofisicas , 1990 .

[30]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[31]  Thierry Denoeux,et al.  A k-nearest neighbor classification rule based on Dempster-Shafer theory , 1995, IEEE Trans. Syst. Man Cybern..

[32]  Paul V. Bolstad,et al.  Semi-automated training approaches for spectral class definition , 1992 .

[33]  Ivan Tomek,et al.  A Generalization of the k-NN Rule , 1976, IEEE Transactions on Systems, Man, and Cybernetics.

[34]  I. Tomek An Experiment with the Edited Nearest-Neighbor Rule , 1976 .

[35]  Gunter Ritter,et al.  Outliers in statistical pattern recognition and an application to automatic chromosome classification , 1997, Pattern Recognit. Lett..