Finite-sample analysis of impacts of unlabeled data and their labeling mechanisms in linear discriminant analysis

ABSTRACT It is widely believed that unlabeled data are promising for improving prediction accuracy in classification problems. Although theoretical studies about when/how unlabeled data are beneficial exist, an actual prediction improvement has not been sufficiently investigated for a finite sample in a systematic manner. We investigate the impact of unlabeled data in linear discriminant analysis and compare the error rates of the classifiers estimated with/without unlabeled data. Our focus is a labeling mechanism that characterizes the probabilistic structure of occurrence of labeled cases. Results imply that an extremely small proportion of unlabeled data has a large effect on the analysis results.

[1]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[2]  M. Okamoto An Asymptotic Expansion for the Distribution of the Linear Discriminant Function , 1963 .

[3]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[4]  N. E. Day Estimating the components of a mixture of normal distributions , 1969 .

[5]  J. L. Warner,et al.  TRANSFORMATIONS OF MULTIVARIATE DATA , 1971 .

[6]  B. Efron The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis , 1975 .

[7]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[8]  G. McLachlan The bias of the apparent error rate in discriminant analysis , 1976 .

[9]  G. McLachlan Bias of Apparent Error Rate in Discriminant-Analysis , 1976 .

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  Terence J. O'Neill Normal Discrimination with Unclassified Observations , 1978 .

[12]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[13]  Geoffrey J. McLachlan,et al.  Asymptotic relative efficiency of the linear discriminant function under partial nonrandom classification of the training data , 1995 .

[14]  B. Flury,et al.  Discrimination Between Two Species ofMicrotususing both Classified and Unclassified Observations , 1995 .

[15]  Vittorio Castelli,et al.  On the exponential value of labeled samples , 1995, Pattern Recognit. Lett..

[16]  Vittorio Castelli,et al.  The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter , 1996, IEEE Trans. Inf. Theory.

[17]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[18]  Fabio Gagliardi Cozman,et al.  Semi-Supervised Learning of Mixture Models , 2003, ICML.

[19]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[20]  Russell V. Lenth,et al.  Statistical Analysis With Missing Data (2nd ed.) (Book) , 2004 .

[21]  Geoffrey J. McLachlan,et al.  Discriminant Analysis and Statistical Pattern Recognition: McLachlan/Discriminant Analysis & Pattern Recog , 2005 .

[22]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[23]  Shin Ishii,et al.  Semi-supervised discovery of differential genes , 2006, BMC Bioinformatics.

[24]  Alan Christoffels,et al.  Comparative genomics in cyprinids: common carp ESTs help the annotation of the zebrafish genome , 2006, BMC Bioinformatics.

[25]  Philippe Rigollet,et al.  Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption , 2006, J. Mach. Learn. Res..

[26]  Trevor Hastie,et al.  Regularized linear discriminant analysis and its application in microarrays. , 2007, Biostatistics.

[27]  Jiawei Han,et al.  Semi-supervised Discriminant Analysis , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[28]  Christopher Joseph Pal,et al.  Semi-supervised classification with hybrid generative/discriminative methods , 2007, KDD '07.

[29]  Larry A. Wasserman,et al.  Statistical Analysis of Semi-Supervised Regression , 2007, NIPS.

[30]  Shinichi Nakajima,et al.  Semi-Supervised Local Fisher Discriminant Analysis for Dimensionality Reduction , 2008, PAKDD.

[31]  Robert D. Nowak,et al.  Unlabeled data: Now it helps, now it doesn't , 2008, NIPS.

[32]  Peter Filzmoser,et al.  CLASSIFICATION EFFICIENCIES FOR ROBUST LINEAR DISCRIMINANT ANALYSIS , 2008 .

[33]  Nataliya Sokolovska,et al.  The asymptotics of semi-supervised learning in discriminative probabilistic models , 2008, ICML '08.

[34]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[35]  Jan R. Magnus,et al.  Maximum Likelihood Estimation of the Multivariate Normal Mixture Model , 2009 .

[36]  Krishnakumar Balasubramanian,et al.  Asymptotic Analysis of Generative Semi-Supervised Learning , 2010, ICML.

[37]  Shuangge Ma,et al.  Fuzzy Canonical Discriminant Analysis: Theory and Practice , 2011, Commun. Stat. Simul. Comput..

[38]  Joaquín Muñoz-García,et al.  Influence Analysis on Discriminant Coordinates , 2011, Commun. Stat. Simul. Comput..

[39]  Jerzy Tiuryn,et al.  The R Package bgmm: Mixture Modeling with Uncertain Knowledge , 2012 .

[40]  Takafumi Kanamori,et al.  Semi-supervised learning with density-ratio estimation , 2012, Machine Learning.

[41]  Keiji Takai,et al.  Asymptotic Inference with Incomplete Data , 2013 .

[42]  K. Hayashi,et al.  Effects of unlabeled data on classification error in normal discriminant analysis , 2014 .

[43]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[44]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.