A scalable and effective rough set theory-based approach for big data pre-processing

A big challenge in the knowledge discovery process is to perform data pre-processing, specifically feature selection, on a large amount of data and high dimensional attribute set. A variety of techniques have been proposed in the literature to deal with this challenge with different degrees of success as most of these techniques need further information about the given input data for thresholding, need to specify noise levels or use some feature ranking procedures. To overcome these limitations, rough set theory (RST) can be used to discover the dependency within the data and reduce the number of attributes enclosed in an input data set while using the data alone and requiring no supplementary information. However, when it comes to massive data sets, RST reaches its limits as it is highly computationally expensive. In this paper, we propose a scalable and effective rough set theory-based approach for large-scale data pre-processing, specifically for feature selection, under the Spark framework. In our detailed experiments, data sets with up to 10,000 attributes have been considered, revealing that our proposed solution achieves a good speedup and performs its feature selection task well without sacrificing performance. Thus, making it relevant to big data.

[1]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[2]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Huan Liu,et al.  Feature Selection: An Ever Evolving Frontier in Data Mining , 2010, FSDM.

[4]  Johannes Schneider,et al.  Scalable density-based clustering with quality guarantees using random projections , 2017, Data Mining and Knowledge Discovery.

[5]  Sergio Ramírez-Gallego,et al.  Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach , 2015 .

[6]  K. Thangavel,et al.  Dimensionality reduction based on rough set theory: A review , 2009, Appl. Soft Comput..

[7]  Witold Pedrycz,et al.  Rough sets in distributed decision information systems , 2016, Knowl. Based Syst..

[8]  M. Anusha,et al.  Big Data-Survey , 2016 .

[9]  Joseph Sarkis,et al.  Integrating sustainability into supplier selection with grey system and rough set methodologies , 2010 .

[10]  Ashish Ghosh,et al.  Self-adaptive differential evolution for feature selection in hyperspectral image data , 2013, Appl. Soft Comput..

[11]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[12]  James Bailey,et al.  Discovering outlying aspects in large datasets , 2016, Data Mining and Knowledge Discovery.

[13]  Verónica Bolón-Canedo,et al.  On the scalability of feature selection methods on high-dimensional data , 2017, Knowledge and Information Systems.

[14]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[15]  Pawan Lingras,et al.  Unsupervised Rough Set Classification Using GAs , 2001, Journal of Intelligent Information Systems.

[16]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[17]  Andrzej Skowron,et al.  Rudiments of rough sets , 2007, Inf. Sci..

[18]  Nasser Ghasem-Aghaee,et al.  Text feature selection using ant colony optimization , 2009, Expert Syst. Appl..

[19]  Shusaku Tsumoto,et al.  Rough Set Data Analysis , 2009, Encyclopedia of Complexity and Systems Science.

[20]  Shengrui Wang,et al.  Multiple Bayesian discriminant functions for high-dimensional massive data classification , 2016, Data Mining and Knowledge Discovery.

[21]  María José del Jesús,et al.  Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks , 2014, WIREs Data Mining Knowl. Discov..

[22]  Andrzej Skowron,et al.  Local rough set: A solution to rough data analysis in big data , 2018, Int. J. Approx. Reason..

[23]  Witold Pedrycz,et al.  Positive approximation: An accelerator for attribute reduction in rough set theory , 2010, Artif. Intell..

[24]  Pawan Lingras,et al.  Rough set clustering for Web mining , 2002, 2002 IEEE World Congress on Computational Intelligence. 2002 IEEE International Conference on Fuzzy Systems. FUZZ-IEEE'02. Proceedings (Cat. No.02CH37291).

[25]  Patrick Schäfer,et al.  Scalable time series classification , 2016, Data Mining and Knowledge Discovery.

[26]  Yang Gao,et al.  Classification of high-dimensional evolving data streams via a resource-efficient online ensemble , 2017, Data Mining and Knowledge Discovery.

[27]  S. R,et al.  Data Mining with Big Data , 2017, 2017 11th International Conference on Intelligent Systems and Control (ISCO).

[28]  Matteo Stocchero,et al.  Untargeted metabolomics: an emerging approach to determine the composition of herbal products , 2013, Computational and structural biotechnology journal.

[29]  Mohammed J. Zaki,et al.  A distributed approach for graph mining in massive networks , 2016, Data Mining and Knowledge Discovery.

[30]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[31]  Naoaki Ono,et al.  Data Mining Methods for Omics and Knowledge of Crude Medicinal Plants toward Big Data Biology , 2013, Computational and structural biotechnology journal.

[32]  Janusz Zalewski,et al.  Rough sets: Theoretical aspects of reasoning about data , 1996 .

[33]  Huan Liu,et al.  Manipulating Data and Dimension Reduction Methods: Feature Selection , 2009, Encyclopedia of Complexity and Systems Science.

[34]  Mengjie Zhang,et al.  Enhanced feature selection for biomarker discovery in LC-MS data using GP , 2013, 2013 IEEE Congress on Evolutionary Computation.

[35]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[36]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[37]  Daniel M. Batista,et al.  A Survey of Large Scale Data Management Approaches in Cloud Environments , 2011, IEEE Communications Surveys & Tutorials.

[38]  HerreraFrancisco,et al.  Big Data with Cloud Computing , 2014 .

[39]  El-Sayed M. El-Alfy,et al.  Towards scalable rough set based attribute subset selection for intrusion detection using parallel genetic algorithm in MapReduce , 2016, Simul. Model. Pract. Theory.

[40]  Christine Zarges,et al.  A distributed rough set theory based algorithm for an efficient big data pre-processing under the spark framework , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[41]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[42]  Wei Fan,et al.  Mining big data: current status, and forecast to the future , 2013, SKDD.

[43]  Ivo Düntsch,et al.  Rough Set Data Analysis , 2000 .

[44]  Jerzy W. Grzymala-Busse,et al.  Data mining and rough set theory , 2000, CACM.

[45]  James G. Shanahan,et al.  Large Scale Distributed Data Science using Apache Spark , 2015, KDD.