One Class Splitting Criteria for Random Forests

Random Forests (RFs) are strong machine learning tools for classification and regression. However, they remain supervised algorithms, and no extension of RFs to the one-class setting has been proposed, except for techniques based on second-class sampling. This work fills this gap by proposing a natural methodology to extend standard splitting criteria to the one-class setting, structurally generalizing RFs to one-class classification. An extensive benchmark of seven state-of-the-art anomaly detection algorithms is also presented. This empirically demonstrates the relevance of our approach.

[1]  Jean-Philippe Vert,et al.  Consistency of Random Forests , 2014, 1405.2881.

[2]  Robert D. Nowak,et al.  Learning Minimum Volume Sets , 2005, J. Mach. Learn. Res..

[3]  Stéphan Clémençon,et al.  Anomaly Ranking as Supervised Bipartite Ranking , 2014, ICML.

[4]  S. Horvath,et al.  Unsupervised Learning With Random Forest Predictors , 2006 .

[5]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[6]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[7]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[8]  Caroline Petitjean,et al.  One class random forests , 2013, Pattern Recognit..

[9]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[10]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[11]  Luc Devroye,et al.  Consistency of Random Forests and Other Averaging Classifiers , 2008, J. Mach. Learn. Res..

[12]  Masashi Sugiyama,et al.  A least-squares approach to anomaly detection in static and sequential data , 2014, Pattern Recognit. Lett..

[13]  Graham J. Williams,et al.  On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms , 2000, KDD '00.

[14]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[15]  Fei Tony Liu,et al.  Isolation-Based Anomaly Detection , 2012, TKDD.

[16]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[17]  HoTin Kam The Random Subspace Method for Constructing Decision Forests , 1998 .

[18]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[19]  Saso Dzeroski,et al.  Combining Bagging and Random Subspaces to Create Better Ensembles , 2007, IDA.

[20]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[21]  Paola Zuccolotto,et al.  Variable Selection Using Random Forests , 2006 .

[22]  Gilles Louppe,et al.  Understanding Random Forests: From Theory to Practice , 2014, 1407.7502.

[23]  Stéphan Clémençon,et al.  Tree-Based Ranking Methods , 2009, IEEE Transactions on Information Theory.

[24]  Jung-Min Park,et al.  An overview of anomaly detection techniques: Existing solutions and latest technological trends , 2007, Comput. Networks.

[25]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Ali A. Ghorbani,et al.  A detailed analysis of the KDD CUP 99 data set , 2009, 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications.

[27]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[28]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[29]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[30]  Hans-Peter Kriegel,et al.  On Evaluation of Outlier Rankings and Outlier Scores , 2012, SDM.

[31]  Yali Amit,et al.  Shape Quantization and Recognition with Randomized Trees , 1997, Neural Computation.

[32]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[33]  Gilles Blanchard,et al.  Oracle Bounds and Exact Algorithm for Dyadic Classification Trees , 2004, COLT.

[34]  Robin Genuer,et al.  Random Forests: some methodological insights , 2008, 0811.3619.

[35]  Eleazar Eskin,et al.  Anomaly Detection over Noisy Data using Learned Probability Distributions , 2000, ICML.

[36]  D. Mason,et al.  Generalized quantile processes , 1992 .

[37]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[38]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[39]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[40]  Gilles Blanchard,et al.  Novelty detection: Unlabeled data definitely help , 2009, AISTATS.

[41]  M. Shyu,et al.  A Novel Anomaly Detection Scheme Based on Principal Component Classifier , 2003 .

[42]  Erwan Scornet,et al.  A random forest guided tour , 2015, TEST.

[43]  Robert P. W. Duin,et al.  Uniform Object Generation for Optimizing One-class Classifiers , 2002, J. Mach. Learn. Res..

[44]  W. Polonik Minimum volume sets and generalized quantile processes , 1997 .

[45]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[46]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[47]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.