Improving Classification Accuracy based on Random Forest Model through Weighted Sampling for Noisy Data with Linear Decision Boundary

Background: Random forest algorithms tend to use a simple random sampling of observations in building their decision trees. The random selection has the chance for noisy, outlier and non informative data to take place during the construction of trees. This leads to inappropriate and poor ensemble classification decision. This paper aims to optimize, the sample selection through probability proportional to size sampling (weighted sampling) in which the noisy, outlier and non informative data points are down weighted to improve the classification accuracy of the model. Methods: The weights of each data point is determined in two aspects, finding each data point influence on the model through Leave-One-Out method using a single classification tree and measuring the deviance residual of each data point using logistic regression model, these are combined as the final weight. Results: The proposed Finest Random Forest (FRF) performs consistently better than the conventional Random Forest (RF) in terms of classification accuracy. Conclusion: The classification accuracy is improved when random forest is composed with probability proportional to size sampling (weighted sampling) for noisy data with linear decision boundary.

[1]  Robert C. Glen,et al.  Random Forest Models To Predict Aqueous Solubility , 2007, J. Chem. Inf. Model..

[2]  M. Elter,et al.  The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process. , 2007, Medical physics.

[3]  Peng Jiang,et al.  MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features , 2007, Nucleic Acids Res..

[4]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[5]  Yali Amit,et al.  Shape Quantization and Recognition with Randomized Trees , 1997, Neural Computation.

[6]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Sinisa Pajevic,et al.  Short-term prediction of mortality in patients with systemic lupus erythematosus: classification of outcomes using random forests. , 2006, Arthritis and rheumatism.

[8]  Mahesh Pal,et al.  Random forest classifier for remote sensing classification , 2005 .

[9]  Tony R. Martinez,et al.  Improving classification accuracy by identifying and removing instances that should be misclassified , 2011, The 2011 International Joint Conference on Neural Networks.

[10]  Krzysztof J. Cios,et al.  Hybrid inductive machine learning: an overview of CLIP algorithms , 2002 .

[11]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[12]  Lawrence O. Hall,et al.  A Comparison of Decision Tree Ensemble Creation Techniques , 2007 .

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[14]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[15]  E. Polley,et al.  Statistical Applications in Genetics and Molecular Biology Random Forests for Genetic Association Studies , 2011 .

[16]  David S. Siroky Navigating Random Forests and related advances in algorithmic modeling , 2009 .

[17]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[18]  Yunming Ye,et al.  A Tree Selection Model for Improved Random Forest , 2011 .

[19]  Manish Kumar,et al.  Forecasting Stock Index Movement: A Comparison of Support Vector Machines and Random Forest , 2006 .

[20]  Steve Horvath,et al.  Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma , 2005, Modern Pathology.

[21]  W. N. Street,et al.  Image analysis and machine learning applied to breast cancer diagnosis and prognosis. , 1995, Analytical and quantitative cytology and histology.

[22]  I-Cheng Yeh,et al.  Knowledge discovery on RFM model using Bernoulli sequence , 2009, Expert Syst. Appl..

[23]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.