论文信息 - Statistical Hypothesis Testing in Positive Unlabelled Data

Statistical Hypothesis Testing in Positive Unlabelled Data

We propose a set of novel methodologies which enable valid statistical hypothesis testing when we have only positive and unlabelled (PU) examples. This type of problem, a special case of semi-supervised data, is common in text mining, bioinformatics, and computer vision. Focusing on a generalised likelihood ratio test, we have 3 key contributions: (1) a proof that assuming all unlabelled examples are negative cases is sufficient for independence testing, but not for power analysis activities; (2) a new methodology that compensates this and enables power analysis, allowing sample size determination for observing an effect with a desired power; and finally, (3) a new capability, supervision determination, which can determine a-priori the number of labelled examples the user must collect before being able to observe a desired statistical effect. Beyond general hypothesis testing, we suggest the tools will additionally be useful for information theoretic feature selection, and Bayesian Network structure learning.

[1] Gilles Blanchard,et al. Semi-Supervised Novelty Detection , 2010, J. Mach. Learn. Res..

[2] Kevin Chen-Chuan Chang,et al. PEBL: positive example based learning for Web page classification using SVM , 2002, KDD.

[3] Pedro Larrañaga,et al. Learning Bayesian classifiers from positive and unlabeled examples , 2007, Pattern Recognit. Lett..

[4] Charles Elkan,et al. Learning classifiers from only positive and unlabeled data , 2008, KDD.

[5] Liam Paninski,et al. Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[6] Jean-Paul Chilès,et al. Wiley Series in Probability and Statistics , 2012 .

[7] A. Agresti,et al. Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[8] N. Cox. Statistical Models in Engineering , 1970 .

[9] Masoud Nikravesh,et al. Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[10] Arthur Gretton,et al. Consistent Nonparametric Tests of Independence , 2010, J. Mach. Learn. Res..

[11] Philip S. Yu,et al. Partially Supervised Classification of Text Documents , 2002, ICML.

[12] Paul D. Ellis,et al. The essential guide to effect sizes : statistical power, meta-analysis, and the interpretation of research results , 2010 .

[13] Hendrik Marks,et al. The PinkThing for analysing ChIP profiling data in their genomic context , 2013, BMC Research Notes.

[14] David A. Cohn,et al. Improving generalization with active learning , 1994, Machine Learning.

[15] Frann Cois Denis,et al. PAC Learning from Positive Statistical Queries , 1998, ALT.

[16] Daniel A. Keim,et al. On Knowledge Discovery and Data Mining , 1997 .

[17] Davide Bacciu,et al. Efficient identification of independence networks using mutual information , 2012, Computational Statistics.

[18] Gavin Brown,et al. Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[19] Jacob Cohen. Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[20] Masoud Nikravesh,et al. Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[21] P. Lachenbruch. Statistical Power Analysis for the Behavioral Sciences (2nd ed.) , 1989 .