Estimating the class prior for positive and unlabelled data via logistic regression

In the paper, we revisit the problem of class prior probability estimation with positive and unlabelled data gathered in a single-sample scenario. The task is important as it is known that in positive unlabelled setting, a classifier can be successfully learned if the class prior is available. We show that without additional assumptions, class prior probability is not identifiable and thus the existing non-parametric estimators are necessarily biased in general if extra assumptions are not imposed. The magnitude of their bias is also investigated. The problem becomes identifiable when the probabilistic structure satisfies mild semi-parametric assumptions. Consequently, we propose a method based on a logistic fit and a concave minorization of its (non-concave) log-likelihood. The experiments conducted on artificial and benchmark datasets as well as on a large clinical database MIMIC indicate that the estimation errors for the proposed method are usually lower than for its competitors and that it is robust against departures from logistic settings.

[1]  Rémi Gilleron,et al.  Learning from positive and unlabeled examples , 2000, Theor. Comput. Sci..

[2]  Jan Mielniczuk,et al.  Different Strategies of Fitting Logistic Regression for Positive and Unlabelled Data , 2020, ICCS.

[3]  Gang Niu,et al.  Positive-Unlabeled Learning with Non-Negative Risk Estimator , 2017, NIPS.

[4]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Yi Pan,et al.  Predicting drug-target interaction using positive-unlabeled learning , 2016, Neurocomputing.

[6]  Rolf Ingold,et al.  Performance comparison of multi-label learning algorithms on clinical data for chronic diseases , 2015, Comput. Biol. Medicine.

[7]  Gilles Blanchard,et al.  Classification with Asymmetric Label Noise: Consistency and Maximal Denoising , 2013, COLT.

[8]  T. Hastie,et al.  Presence‐Only Data and the EM Algorithm , 2009, Biometrics.

[9]  Eyke Hüllermeier,et al.  Maximum Likelihood Estimation and Coarse Data , 2017, SUM.

[10]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[11]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[12]  D. Rubin,et al.  Ignorability and Coarse Data , 1991 .

[13]  Masashi Sugiyama,et al.  Class Prior Estimation from Positive and Unlabeled Data , 2014, IEICE Trans. Inf. Syst..

[14]  Martha White,et al.  Estimating the class prior and posterior from noisy positives and unlabeled data , 2016, NIPS.

[15]  Andreas Spanias,et al.  Positive And Unlabeled Learning Algorithms And Applications: A Survey , 2019, 2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA).

[16]  Philip S. Yu,et al.  Text classification without negative examples revisit , 2006, IEEE Transactions on Knowledge and Data Engineering.

[17]  Cheng Soon Ong,et al.  Learning from Corrupted Binary Labels via Class-Probability Estimation , 2015, ICML.

[18]  Jan Mielniczuk,et al.  What Do We Choose When We Err? Model Selection and Testing for Misspecified Logistic Regression Revisited , 2016, Challenges in Computational Statistics and Data Mining.

[19]  Damien Zufferey,et al.  Cost-sensitive classifier chains: Selecting low-cost features in multi-label classification , 2019, Pattern Recognit..

[20]  S. Eack,et al.  Under-reporting of drug use among individuals with schizophrenia: prevalence and predictors , 2013, Psychological Medicine.

[21]  Ker-Chau Li,et al.  Regression Analysis Under Link Violation , 1989 .

[22]  Xiaoli Li,et al.  Ensemble Positive Unlabeled Learning for Disease Gene Identification , 2014, PloS one.

[23]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[24]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[25]  Herman J. Bierens,et al.  Uniform Consistency of Kernel Estimators of a Regression Function under Generalized Conditions , 1983 .

[26]  G. Imbens,et al.  Case-control studies with contaminated controls☆ , 1996 .

[27]  Mark S. Boyce,et al.  Modelling distribution and abundance with presence‐only data , 2006 .

[28]  Garvesh Raskutti,et al.  PUlasso: High-Dimensional Variable Selection With Presence-Only Data , 2017, Journal of the American Statistical Association.

[29]  Ambuj Tewari,et al.  Mixture Proportion Estimation via Kernel Embeddings of Distributions , 2016, ICML.

[30]  Thomas M. Cover,et al.  Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) , 2006 .

[31]  H. Ichimura,et al.  SEMIPARAMETRIC LEAST SQUARES (SLS) AND WEIGHTED SLS ESTIMATION OF SINGLE-INDEX MODELS , 1993 .

[32]  Jesse Davis,et al.  Estimating the Class Prior in Positive and Unlabeled Data Through Decision Tree Induction , 2018, AAAI.

[33]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[34]  Jesse Davis,et al.  Learning from positive and unlabeled data: a survey , 2018, Machine Learning.

[35]  Charles Elkan,et al.  A Modified Logistic Regression for Positive and Unlabeled Learning , 2019, 2019 53rd Asilomar Conference on Signals, Systems, and Computers.

[36]  Wei Liu Machine Learning Algorithms and Applications: A Survey , 2018 .

[37]  Kenneth Lange,et al.  Numerical analysis for statisticians , 1999 .

[38]  Gang Niu,et al.  Class-prior estimation for learning from positive and unlabeled data , 2016, Machine Learning.

[39]  Cheng-Chung Fang,et al.  Underreporting of illicit drug use by patients at emergency departments as revealed by two-tiered urinalysis. , 2006, Addictive behaviors.

[40]  Clayton Scott,et al.  A Rate of Convergence for Mixture Proportion Estimation, with Application to Learning from Noisy Labels , 2015, AISTATS.

[41]  Mao-Bin Hu,et al.  Traffic Experiment Reveals the Nature of Car-Following , 2014, PloS one.

[42]  Pawel Teisseyre,et al.  Learning Classifier Chains Using Matrix Regularization: Application to Multimorbidity Prediction , 2020, ECAI.

[43]  Dan Steinberg,et al.  Estimating logistic regression models when the dependent variable has no variance , 1992 .

[44]  Jordan M. Malof,et al.  Distributed solar photovoltaic array location and extent dataset for remote sensing object identification , 2016, Scientific Data.

[45]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[46]  Gavin Brown,et al.  Dealing with under-reported variables: An information theoretic solution , 2017, Int. J. Approx. Reason..

[47]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[48]  Matthew H. Brush,et al.  Characteristics of undiagnosed diseases network applicants: implications for referring providers , 2018, BMC Health Services Research.

[49]  Sheila M. C. Inglis,et al.  Developing a framework to evaluate knowledge into action interventions , 2018, BMC Health Services Research.