The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter

We observe a training set Q composed of l labeled samples {(X/sub 1/,/spl theta//sub 1/),...,(X/sub l/, /spl theta//sub l/)} and u unlabeled samples {X/sub 1/',...,X/sub u/'}. The labels /spl theta//sub i/ are independent random variables satisfying Pr{/spl theta//sub i/=1}=/spl eta/, Pr{/spl theta//sub i/=2}=1-/spl eta/. The labeled observations X/sub i/ are independently distributed with conditional density f/sub /spl theta/i/(/spl middot/) given /spl theta//sub i/. Let (X/sub 0/,/spl theta//sub 0/) be a new sample, independently distributed as the samples in the training set. We observe X/sub 0/ and we wish to infer the classification /spl theta//sub 0/. In this paper we first assume that the distributions f/sub 1/(/spl middot/) and f/sub 2/(/spl middot/) are given and that the mixing parameter is unknown. We show that the relative value of labeled and unlabeled samples in reducing the risk of optimal classifiers is the ratio of the Fisher informations they carry about the parameter /spl eta/. We then assume that two densities g/sub 1/(/spl middot/) and g/sub 2/(/spl middot/) are given, but we do not know whether g/sub 1/(/spl middot/)=f/sub 1/(/spl middot/) and g/sub 2/(/spl middot/)=f/sub 2/(/spl middot/) or if the opposite holds, nor do we know /spl eta/. Thus the learning problem consists of both estimating the optimum partition of the observation space and assigning the classifications to the decision regions. Here, we show that labeled samples are necessary to construct a classification rule and that they are exponentially more valuable than unlabeled samples.

[1]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[2]  Jerry M. Mendel,et al.  Adaptive, learning, and pattern recognition systems : theory and applications , 1970 .

[3]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[4]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[5]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[6]  G. McLachlan Estimating the Linear Discriminant Function from Initial Samples Containing a Small Number of Unclassified Observations , 1977 .

[7]  Terence J. O'Neill Normal Discrimination with Unclassified Observations , 1978 .

[8]  Terence J. O'Neill The General Distribution of the Error Rate of a Classification Procedure With Application to Logistic Regression Discrimination , 1980 .

[9]  Anil K. Jain,et al.  39 Dimensionality and sample size considerations in pattern recognition practice , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[10]  G. McLachlan,et al.  Updating a discriminant function in basis of unclassified data , 1982 .

[11]  Luc Devroye,et al.  Automatic Pattern Recognition: A Study of the Probability of Error , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  R. H. Phaf,et al.  CALM networks: a modular approach to supervised and unsupervised learning , 1989, International 1989 Joint Conference on Neural Networks.

[13]  A. Dale Magoun,et al.  Decision, estimation and classification , 1989 .

[14]  C. W. Therrien,et al.  Decision, Estimation and Classification: An Introduction to Pattern Recognition and Related Topics , 1989 .

[15]  R. M. Holdaway Enhancing supervised learning algorithms via self-organization , 1989, International 1989 Joint Conference on Neural Networks.

[16]  A. M. Peterson,et al.  Nonlinear mapping with minimal supervised learning , 1990, Twenty-Third Annual Hawaii International Conference on System Sciences.

[17]  M. F. da Mota Tenorio The self organizing neural network algorithm: adapting structure for optimum supervised learning , 1990 .

[18]  L. Tierney,et al.  The validity of posterior expansions based on Laplace''s method , 1990 .

[19]  R. Chellappa,et al.  Texture analysis via unsupervised and supervised learning , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[20]  Dejan J. Sobajic,et al.  Combined use of unsupervised and supervised learning for dynamic security assessment , 1991 .

[21]  Anil K. Jain,et al.  Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  A. SenGupta A review of optimality of multivariate tests , 1991 .

[23]  David A. Landgrebe,et al.  Asymptotic improvement of supervised learning by utilizing additional unlabeled samples: normal mixture density case , 1992, Optics & Photonics.

[24]  T. Cover,et al.  The relative value of labeled and unlabeled samples in pattern recognition , 1993, Proceedings. IEEE International Symposium on Information Theory.

[25]  Vittorio Castelli,et al.  On the exponential value of labeled samples , 1995, Pattern Recognit. Lett..

[26]  Amir Dembo,et al.  Large Deviations Techniques and Applications , 1998 .