Improved Random Forest for Classification

We propose an improved random forest classifier that performs classification with a minimum number of trees. The proposed method iteratively removes some unimportant features. Based on the number of important and unimportant features, we formulate a novel theoretical upper limit on the number of trees to be added to the forest to ensure improvement in classification accuracy. Our algorithm converges with a reduced but important set of features. We prove that further addition of trees or further reduction of features does not improve classification performance. The efficacy of the proposed approach is demonstrated through experiments on benchmark data sets. We further use the proposed classifier to detect mitotic nuclei in the histopathological data sets of breast tissues. We also apply our method on the industrial data set of dual-phase steel microstructures to classify different phases. Results of our method on different data sets show significant reduction in an average classification error compared with a number of competing methods.

[1]  Thomas Pardoen,et al.  Damage and fracture of dual-phase steels: Influence of martensite volume fraction , 2015 .

[2]  Luca Maria Gambardella,et al.  Mitosis Detection in Breast Cancer Histology Images with Deep Neural Networks , 2013, MICCAI.

[3]  Hemant Ishwaran,et al.  The effect of splitting on random forests , 2014, Machine Learning.

[4]  Dagoberto Brandão Santos,et al.  Microstructural evolution at the initial stages of continuous annealing of cold rolled dual-phase steel , 2005 .

[5]  Dimitris N. Metaxas,et al.  Entangled Decision Forests and Their Application for Semantic Segmentation of CT Images , 2011, IPMI.

[6]  Olivier Debeir,et al.  Limiting the Number of Trees in Random Forests , 2001, Multiple Classifier Systems.

[7]  Mandy Eberhart,et al.  Decision Forests For Computer Vision And Medical Image Analysis , 2016 .

[8]  Jonathan J. Hull,et al.  A Database for Handwritten Text Recognition Research , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Eric Cosatto,et al.  Classification of mitotic figures with convolutional neural networks and seeded blob features , 2013, Journal of pathology informatics.

[10]  Tom Bylander,et al.  Estimating Generalization Error on Two-Class Datasets Using Out-of-Bag Estimates , 2002, Machine Learning.

[11]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[12]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007 .

[13]  Angshul Majumdar,et al.  Semi Supervised Autoencoder , 2016, ICONIP.

[14]  Bernhard Schölkopf,et al.  Comparing support vector machines with Gaussian kernels to radial basis function classifiers , 1997, IEEE Trans. Signal Process..

[15]  H. I. Aaronson,et al.  Chemical polishing of steel for optical microscopy , 1976 .

[16]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[17]  Zoubin Ghahramani,et al.  A Very Simple Safe-Bayesian Random Forest , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[19]  Jun Yu,et al.  Locating Facial Landmarks Using Probabilistic Random Forest , 2015, IEEE Signal Processing Letters.

[20]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[21]  Dipti Prasad Mukherjee,et al.  Mitosis Detection for Invasive Breast Cancer Grading in Histopathological Images , 2015, IEEE Transactions on Image Processing.

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  Jayanthi Sivaswamy,et al.  Regenerative Random Forest with Automatic Feature Selection to Detect Mitosis in Histopathological Breast Cancer Images , 2015, MICCAI.

[24]  José Augusto Baranauskas,et al.  How Many Trees in a Random Forest? , 2012, MLDM.

[25]  Mohamed Medhat Gaber,et al.  An Information-Theoretic Approach for Setting the Optimal Number of Decision Trees in Random Forests , 2013, 2013 IEEE International Conference on Systems, Man, and Cybernetics.

[26]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[27]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[28]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[29]  Bernhard Schölkopf,et al.  Introduction to Semi-Supervised Learning , 2006, Semi-Supervised Learning.

[30]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.