Voting-based instance selection from large data sets with MapReduce and random weight networks

Instance selection is an important preprocessing step in machine learning. By choosing a subset of a data set, it achieves the same performance of a machine learning algorithm as if the whole data set is used, and it enables a machine learning algorithm to be feasible for and to work effectively with large data sets. Based on voting mechanism, this paper proposes a large data sets instance selection algorithm with MapReduce and random weight networks (RWNs). Firstly, the proposed algorithm employs the Map of MapReduce to partition the large data sets into some small subsets, and deploys them to different cloud computing nodes. Secondly, the informative instances are selected in parallel with an instance selection algorithm. Thirdly, the Reduce of MapReduce is used to collect the selected instances from different cloud computing nodes and a selected instance subset is obtained. The previous three processes are repeated p times (p is a parameter defined by the user), and p instance subsets are obtained. Finally, the voting method is used to select the most informative instances from the p subsets. The random weight network classifier is trained with the selected instance subset, and the testing accuracy is verified on the testing set. The proposed algorithm is experimentally compared with three state-of-the-art approaches which are CNN, ENN and RNN. The experimental results show that the proposed algorithm is effective and efficient.

[1]  Aytug Onan,et al.  A fuzzy-rough nearest neighbor classifier combined with consistency-based subset evaluation and instance selection for automated diagnosis of breast cancer , 2015, Expert Syst. Appl..

[2]  William Eberle,et al.  Learning to detect representative data for large scale instance selection , 2015, J. Syst. Softw..

[3]  Dianhui Wang,et al.  A local learning algorithm for random weights networks , 2015, Knowl. Based Syst..

[4]  Le Zhang,et al.  A survey of randomized algorithms for training neural networks , 2016, Inf. Sci..

[5]  Nicolás García-Pedrajas,et al.  Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts , 2010, Artif. Intell..

[6]  Gene H. Golub,et al.  Matrix computations , 1983 .

[7]  Hugh B. Woodruff,et al.  An algorithm for a selective nearest neighbor decision rule (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[8]  Yuhua Li,et al.  Selecting Critical Patterns Based on Local Geometrical and Statistical Information , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Yu-Lin He,et al.  Fuzzy nonlinear regression analysis using a random weight network , 2016, Inf. Sci..

[10]  Meikang Qiu,et al.  A decentralized approach for mining event correlations in distributed system monitoring , 2013, J. Parallel Distributed Comput..

[11]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[12]  Belur V. Dasarathy,et al.  Minimal consistent set (MCS) identification for optimal nearest neighbor decision systems design , 1994, IEEE Trans. Syst. Man Cybern..

[13]  Tommy W. S. Chow,et al.  Comments on "Stochastic choice of basis functions in adaptive function approximation and the functional-link net" [and reply] , 1997, IEEE Trans. Neural Networks.

[14]  Francisco Herrera,et al.  MRPR: A MapReduce solution for prototype reduction in big data classification , 2015, Neurocomputing.

[15]  G. Gates,et al.  The reduced nearest neighbor rule (Corresp.) , 1972, IEEE Trans. Inf. Theory.

[16]  Shuai Li,et al.  A MapReduce based parallel SVM for large-scale predicting protein-protein interactions , 2014, Neurocomputing.

[17]  P. N. Suganthan,et al.  A comprehensive evaluation of random vector functional link networks , 2016, Inf. Sci..

[18]  Dianhui Wang,et al.  A probabilistic learning algorithm for robust modeling using neural networks with random weights , 2015, Inf. Sci..

[19]  Witold Pedrycz,et al.  A Study on Relationship Between Generalization Abilities and Fuzziness of Base Classifiers in Ensemble Learning , 2015, IEEE Transactions on Fuzzy Systems.

[20]  Dianhui Wang,et al.  Fast decorrelated neural network ensembles with random weights , 2014, Inf. Sci..

[21]  Yoh-Han Pao,et al.  Stochastic choice of basis functions in adaptive function approximation and the functional-link net , 1995, IEEE Trans. Neural Networks.

[22]  Joshua Zhexue Huang,et al.  Recent advances in multiple criteria decision making techniques , 2016, Int. J. Mach. Learn. Cybern..

[23]  Dejan J. Sobajic,et al.  Learning and generalization characteristics of the random vector Functional-link net , 1994, Neurocomputing.

[24]  Tasawar Hayat,et al.  Numerical solutions of fuzzy differential equations using reproducing kernel Hilbert space method , 2015, Soft Computing.

[25]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[26]  Meikang Qiu,et al.  Online optimization for scheduling preemptable tasks on IaaS cloud systems , 2012, J. Parallel Distributed Comput..

[27]  Antonio González Muñoz,et al.  A Set of Complexity Measures Designed for Applying Meta-Learning to Instance Selection , 2015, IEEE Transactions on Knowledge and Data Engineering.

[28]  Amir F. Atiya,et al.  A Novel Template Reduction Approach for the $K$-Nearest Neighbor Method , 2009, IEEE Transactions on Neural Networks.

[29]  Qishan Zhang,et al.  Community discovery by propagating local and global information based on the MapReduce model , 2015, Inf. Sci..

[30]  Za'er Salim Abo-Hammour,et al.  Numerical solution of systems of second-order boundary value problems using continuous genetic algorithm , 2014, Inf. Sci..

[31]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[32]  Xu Zhou,et al.  Effective algorithms of the Moore-Penrose inverse matrices for extreme learning machine , 2015, Intell. Data Anal..

[33]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[34]  Q. Henry Wu,et al.  A class boundary preserving algorithm for data condensation , 2011, Pattern Recognit..

[35]  Francesco Marcelloni,et al.  A MapReduce solution for associative classification of big data , 2016, Inf. Sci..

[36]  Xizhao Wang,et al.  Dynamic ensemble extreme learning machine based on sample entropy , 2012, Soft Comput..

[37]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  Dianhui Wang,et al.  An iterative learning algorithm for feedforward neural networks with random weights , 2016, Inf. Sci..

[39]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[40]  Javier Pérez-Rodríguez,et al.  OligoIS: Scalable Instance Selection for Class-Imbalanced Data Sets , 2013, IEEE Transactions on Cybernetics.

[41]  Jian-Jia Chen,et al.  Energy-Efficient Scheduling in Nonpreemptive Systems With Real-Time Constraints , 2013, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[42]  Chris Mellish,et al.  Advances in Instance Selection for Instance-Based Learning Algorithms , 2002, Data Mining and Knowledge Discovery.

[44]  Xizhao Wang,et al.  Fuzziness based sample categorization for classifier performance improvement , 2015, J. Intell. Fuzzy Syst..

[45]  Xizhao Wang,et al.  Learning from big data with uncertainty - editorial , 2015, J. Intell. Fuzzy Syst..

[46]  Hadi Sadoghi Yazdi,et al.  IRAHC: Instance Reduction Algorithm using Hyperrectangle Clustering , 2015, Pattern Recognit..

[47]  Dianhui Wang,et al.  Distributed learning for Random Vector Functional-Link networks , 2015, Inf. Sci..

[48]  Dejan J. Sobajic,et al.  Neural-net computing and the intelligent control of systems , 1992 .

[49]  Jiandong Wang,et al.  A hierarchical-coevolutionary-MapReduce-based knowledge reduction algorithm with robust ensemble Pareto equilibrium , 2016, Inf. Sci..

[50]  Robert P. W. Duin,et al.  Feedforward neural networks with random weights , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[51]  Antanas Verikas,et al.  Soft combination of neural classifiers: A comparative study , 1999, Pattern Recognit. Lett..

[52]  Omar Abu Arqub,et al.  Adaptation of reproducing kernel algorithm for solving fuzzy Fredholm–Volterra integrodifferential equations , 2017, Neural Computing and Applications.

[53]  Marek Grochowski,et al.  Comparison of Instances Seletion Algorithms I. Algorithms Survey , 2004, ICAISC.

[54]  F. Richard Yu,et al.  Distributed denial of service attacks in software-defined networking with cloud computing , 2015, IEEE Communications Magazine.