Feature subset selection from positive and unlabelled examples

The feature subset selection problem has a growing importance in many machine learning applications where the amount of variables is very high. There is a great number of algorithms that can approach this problem in supervised databases but, when examples from one or more classes are not available, supervised feature subset selection algorithms cannot be directly applied. One of these algorithms is the correlation based filter selection (CFS). In this work we propose an adaptation of this algorithm that can be applied when only positive and unlabelled examples are available. As far as we know, this is the first time the feature subset selection problem is studied in the positive unlabelled learning context. We have tested this adaptation on synthetic datasets obtained by sampling Bayesian network models where we know which variables are (in)dependent of the class. We have also tested our adaptations on real-life databases where the absence of negative examples has been simulated. The results show that, having enough positive examples, it is possible to obtain good solutions to the feature subset selection problem when only positive and unlabelled instances are available.

[1]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[2]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[3]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[4]  J. A. Lozano,et al.  Prioritization of candidate cancer genes—an aid to oncogenomic studies , 2008, Nucleic acids research.

[5]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[6]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[7]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[8]  Franz Pernkopf,et al.  Bayesian network classifiers versus selective k-NN classifier , 2005, Pattern Recognit..

[9]  Moshe Ben-Bassat,et al.  35 Use of distance measures, information measures and error bounds in feature evaluation , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[10]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[11]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[12]  Pedro Larrañaga,et al.  Learning Bayesian classifiers from positive and unlabeled examples , 2007, Pattern Recognit. Lett..

[13]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[14]  David G. Stork,et al.  Pattern Classification , 1973 .

[15]  L. A. Smith,et al.  Feature Subset Selection: A Correlation Based Filter Approach , 1997, ICONIP.

[16]  Thomas M. Cover,et al.  Elements of Information Theory: Cover/Elements of Information Theory, Second Edition , 2005 .

[17]  Rémi Gilleron,et al.  Text Classification from Positive and Unlabeled Examples , 2002 .

[18]  Pedro Larrañaga,et al.  A partially supervised classification approach to dominant and recessive human disease gene prediction , 2007, Comput. Methods Programs Biomed..

[19]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2007 .

[20]  P. Larra,et al.  Feature Subset Selection by Bayesian Networks Based Optimization Abstract|a New Method for Feature Subset Selection in Machine Learning, Fss-ebna , 1999 .

[21]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[22]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[23]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[24]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[25]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[26]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[27]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[28]  Pedro Larrañaga,et al.  Feature Subset Selection by Bayesian network-based optimization , 2000, Artif. Intell..

[29]  Edoardo Amaldi,et al.  On the Approximability of Minimizing Nonzero Variables or Unsatisfied Relations in Linear Systems , 1998, Theor. Comput. Sci..

[30]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[31]  Robert Castelo,et al.  Splice site identification by idlBNs , 2004, ISMB/ECCB.