A nifty collaborative analysis to predicting a novel tool (DRFLLS) for missing values estimation

One of the important trends in an intelligent data analysis will be the growing importance of data processing. But this point faces problems similar to those of data mining (i.e., high-dimensional data, missing value imputation and data integration); one of the challenges in estimation missing value methods is how to select the optimal number of nearest neighbors of those values. This paper, attempting to search the capability of building a novel tool to estimate missing values of various datasets called developed random forest and local least squares (DRFLLS). By developing random forest algorithm, seven categories of similarity measures were defined. These categories are person similarity coefficient, simple similarity, and fuzzy similarity (M1, M2, M3, M4 and M5). They are sufficient to estimate the optimal number of neighborhoods of missing values in this application. Hereafter, local least squares (LLS) has been used to estimate the missing values. Imputation accuracy can be measured in different ways: Pearson correlation (PC) and NRMSE. Then, the optimal number of neighborhoods is associated with the highest value of PC and a smaller value of NRMSE. The experimental results were carried out on six datasets obtained from different disciplines, and DRFLLS proves the dataset which has a small rate of missing values gave the best estimation to the number of nearest neighbors by DRFPC and in the second degree by DRFFSM1 when r  = 4, while if the dataset has high rate of missing values, then it gave the best estimation to number of nearest neighbors by DRFFSM5 and in the second degree by DRFFSM3. After that, the missing value was estimated by LLS, and the results accuracy was measured by NRMSE and Pearson correlation. The smallest value of NRMSE for a given dataset is corresponding to DRF correlation function which is a better function for a given dataset. The highest value of PC for a given dataset is corresponding to DRF correlation function which is a better function for a given dataset.

[1]  Mohammed Azmi Al-Betar,et al.  Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering , 2017, Expert Syst. Appl..

[2]  Samaher Hussein Ali Novel Approach for Generating the Key of Stream Cipher System Using Random Forest Data Mining Algorithm , 2013, 2013 Sixth International Conference on Developments in eSystems Engineering.

[3]  Hossam Faris,et al.  Binary dragonfly optimization for feature selection using time-varying transfer functions , 2018, Knowl. Based Syst..

[4]  Hossam Faris,et al.  Asynchronous accelerating multi-leader salp chains for feature selection , 2018, Appl. Soft Comput..

[5]  Derek Greene,et al.  Missing value imputation for epistatic MAPs , 2010, BMC Bioinformatics.

[6]  Hossam Faris,et al.  An efficient hybrid multilayer perceptron neural network with grasshopper optimization , 2018, Soft Computing.

[7]  Jaap Heringa,et al.  PhyloPars: estimation of missing parameter values using phylogeny , 2009, Nucleic Acids Res..

[8]  Hossam Faris,et al.  An enhanced associative learning-based exploratory whale optimizer for global optimization , 2019, Neural Computing and Applications.

[9]  Onisimo Mutanga,et al.  Land-use/cover classification in a heterogeneous coastal landscape using RapidEye imagery: evaluating the performance of random forest and support vector machines classifiers , 2014 .

[10]  Jon Atli Benediktsson,et al.  Algorithms and Applications for Land Cover Classification – A Review , 2010 .

[11]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[12]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[13]  Ito Wasito,et al.  Nearest neighbours in least-squares data imputation algorithms with different missing patterns , 2006, Comput. Stat. Data Anal..

[14]  Tshilidzi Marwala,et al.  Missing Data Imputation Through the Use of the Random Forest Algorithm , 2009 .

[15]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[16]  Samaher Al-Janabi,et al.  Pragmatic Miner to Risk Analysis for Intrusion Detection (PMRA-ID) , 2017, SCDS.

[17]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[18]  Laith Mohammad Abualigah,et al.  Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering , 2017, The Journal of Supercomputing.

[19]  Samaher Hussein Ali,et al.  A novel tool (FP-KC) for handle the three main dimensions reduction and association rule mining , 2012, 2012 6th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT).

[20]  Runtong Zhang,et al.  A Comparison of Similarity Measures of Intuitionistic Fuzzy Sets , 2015 .

[21]  Antanas Verikas,et al.  Mining data with random forests: A survey and results of new tests , 2011, Pattern Recognit..

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[24]  Torsten Hothorn,et al.  Recursive partitioning on incomplete data using surrogate decisions and multiple imputation , 2012, Comput. Stat. Data Anal..

[25]  Hong Yan,et al.  Missing value imputation for gene expression data: computational techniques to recover missing data from available information , 2011, Briefings Bioinform..

[26]  Laith Mohammad Abualigah,et al.  APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL , 2015 .

[27]  D. R. Cutler,et al.  Utah State University From the SelectedWorks of , 2017 .

[28]  Wei-Sheng Wu,et al.  Missing value imputation for microarray data: a comprehensive comparison study and a web tool , 2013, BMC Systems Biology.

[29]  Samaher AlJanabi,et al.  Smart system to create an optimal higher education environment using IDA and IOTs , 2018, International Journal of Computers and Applications.

[30]  Mohamed Medhat Gaber,et al.  A fine-grained Random Forests using class decomposition: an application to medical diagnosis , 2016, Neural Computing and Applications.

[31]  Jean-Michel Poggi,et al.  Variable selection using random forests , 2010, Pattern Recognit. Lett..

[32]  Emmanuel John M. Carranza,et al.  Random forest predictive modeling of mineral prospectivity with small number of prospects and data with missing values in Abra (Philippines) , 2015, Comput. Geosci..

[33]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[34]  Carolin Strobl,et al.  Random Forests with Missing Values in the Covariates , 2010 .

[35]  Yanjun Qi,et al.  Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources , 2004, Pacific Symposium on Biocomputing.

[36]  John W. Graham,et al.  Missing Data: Analysis and Design , 2012 .

[37]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[38]  Mohd Saberi Mohamad,et al.  A Review on Missing Value Imputation Algorithms for Microarray Gene Expression Data , 2014 .

[39]  Eric W. T. Ngai,et al.  Customer churn prediction using improved balanced random forests , 2009, Expert Syst. Appl..

[40]  Samiran Chattopadhyay,et al.  Effectiveness of Different Partition Based Clustering Algorithms for Estimation of Missing Values in Microarray Gene Expression Data , 2012, ACITY.

[41]  Sue Ellen Haupt,et al.  The Effects of Imputing Missing Data on Ensemble Temperature Forecasts , 2011, J. Comput..

[42]  J. Marrero,et al.  Comparison of imputation methods for missing laboratory data in medicine , 2013, BMJ Open.

[43]  S. H. Ali,et al.  Miner for OACCR: Case of medical data analysis in knowledge discovery , 2012, 2012 6th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT).