Estimating a Logistic Discrimination Functions When One of the Training Samples Is Subject to Misclassification: A Maximum Likelihood Approach

The problem of discrimination and classification is central to much of epidemiology. Here we consider the estimation of a logistic regression/discrimination function from training samples, when one of the training samples is subject to misclassification or mislabeling, e.g. diseased individuals are incorrectly classified/labeled as healthy controls. We show that this leads to zero-inflated binomial model with a defective logistic regression or discrimination function, whose parameters can be estimated using standard statistical methods such as maximum likelihood. These parameters can be used to estimate the probability of true group membership among those, possibly erroneously, classified as controls. Two examples are analyzed and discussed. A simulation study explores properties of the maximum likelihood parameter estimates and the estimates of the number of mislabeled observations.

[1]  A. Tuyns,et al.  [Esophageal cancer in Ille-et-Vilaine in relation to levels of alcohol and tobacco consumption. Risks are multiplying]. , 1977, Bulletin du cancer.

[2]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[3]  M. Kupperman Linear Statistical Inference and Its Applications 2nd Edition (C. Radhakrishna Rao) , 1975 .

[4]  Ben Taskar,et al.  Learning from Partial Labels , 2011, J. Mach. Learn. Res..

[5]  Ata Kabán,et al.  Multi-class classification in the presence of labelling errors , 2011, ESANN.

[6]  Calyampudi R. Rao,et al.  Linear Statistical Inference and Its Applications. , 1975 .

[7]  A. Mizrahi,et al.  Consommation d'alcool et de tabac , 2003 .

[8]  Gábor Lugosi,et al.  Learning with an unreliable teacher , 1992, Pattern Recognit..

[9]  P. Lachenbruch Discriminant Analysis When the Initial Samples Are Misclassified , 1966 .

[10]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..

[11]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[12]  N. Nagelkerke,et al.  Genome Analysis of Legionella pneumophila Strains Using a Mixed-Genome Microarray , 2012, PloS one.

[13]  N. Nagelkerke,et al.  Logistic discrimination of mixtures of M. tuberculosis and non‐specific tuberculin reactions , 2001, Statistics in medicine.

[14]  Bernhard Schölkopf,et al.  Estimating a Kernel Fisher Discriminant in the Presence of Label Noise , 2001, ICML.

[15]  Ata Kabán,et al.  Learning a Label-Noise Robust Logistic Regression: Analysis and Experiments , 2013, IDEAL.

[16]  K. Kaul,et al.  Molecular detection of Mycobacterium tuberculosis: impact on patient care. , 2001, Clinical chemistry.

[17]  Rocco A. Servedio,et al.  Random classification noise defeats all convex potential boosters , 2008, ICML '08.

[18]  Daniela M. Witten,et al.  An Introduction to Statistical Learning: with Applications in R , 2013 .

[19]  Eduardo Gasca,et al.  Decontamination of Training Samples for Supervised Pattern Recognition Methods , 2000, SSPR/SPR.

[20]  N. L. Johnson,et al.  Linear Statistical Inference and Its Applications , 1966 .

[21]  D. Hall Zero‐Inflated Poisson and Binomial Regression with Random Effects: A Case Study , 2000, Biometrics.

[22]  Ata Kabán,et al.  Label-Noise Robust Logistic Regression and Its Applications , 2012, ECML/PKDD.

[23]  Paul S Albert,et al.  Efficient logistic regression designs under an imperfect population identifier. , 2014, Biometrics.