Statistical Hypothesis Testing in Positive Unlabelled Data

We propose a set of novel methodologies which enable valid statistical hypothesis testing when we have only positive and unlabelled (PU) examples. This type of problem, a special case of semi-supervised data, is common in text mining, bioinformatics, and computer vision. Focusing on a generalised likelihood ratio test, we have 3 key contributions: (1) a proof that assuming all unlabelled examples are negative cases is sufficient for independence testing, but not for power analysis activities; (2) a new methodology that compensates this and enables power analysis, allowing sample size determination for observing an effect with a desired power; and finally, (3) a new capability, supervision determination, which can determine a-priori the number of labelled examples the user must collect before being able to observe a desired statistical effect. Beyond general hypothesis testing, we suggest the tools will additionally be useful for information theoretic feature selection, and Bayesian Network structure learning.

[1]  Gilles Blanchard,et al.  Semi-Supervised Novelty Detection , 2010, J. Mach. Learn. Res..

[2]  Kevin Chen-Chuan Chang,et al.  PEBL: positive example based learning for Web page classification using SVM , 2002, KDD.

[3]  Pedro Larrañaga,et al.  Learning Bayesian classifiers from positive and unlabeled examples , 2007, Pattern Recognit. Lett..

[4]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[5]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[6]  Jean-Paul Chilès,et al.  Wiley Series in Probability and Statistics , 2012 .

[7]  A. Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[8]  N. Cox Statistical Models in Engineering , 1970 .

[9]  Masoud Nikravesh,et al.  Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[10]  Arthur Gretton,et al.  Consistent Nonparametric Tests of Independence , 2010, J. Mach. Learn. Res..

[11]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[12]  Paul D. Ellis,et al.  The essential guide to effect sizes : statistical power, meta-analysis, and the interpretation of research results , 2010 .

[13]  Hendrik Marks,et al.  The PinkThing for analysing ChIP profiling data in their genomic context , 2013, BMC Research Notes.

[14]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[15]  Frann Cois Denis,et al.  PAC Learning from Positive Statistical Queries , 1998, ALT.

[16]  Daniel A. Keim,et al.  On Knowledge Discovery and Data Mining , 1997 .

[17]  Davide Bacciu,et al.  Efficient identification of independence networks using mutual information , 2012, Computational Statistics.

[18]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[19]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[20]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[21]  P. Lachenbruch Statistical Power Analysis for the Behavioral Sciences (2nd ed.) , 1989 .