Semi-random partitioning of data into training and test sets in granular computing context

Due to the vast and rapid increase in the size of data, machine learning has become an increasingly more popular approach for the purpose of knowledge discovery and predictive modelling. For both of the above purposes, it is essential to have a data set partitioned into a training set and a test set. In particular, the training set is used towards learning a model and the test set is then used towards evaluating the performance of the model learned from the training set. The split of the data into the two sets, however, and the influence on model performance, has only been investigated with respect to the optimal proportion for the two sets, with no attention paid to the characteristics of the data within the training and test sets. Thus, the current practice is to randomly split the data into approximately 70% for training and 30% for testing. In this paper, we show that this way of partitioning the data leads to two major issues: (a) class imbalance and (b) sample representativeness issues. Class imbalance is known to affect the performance of many classifiers by introducing a bias towards the majority class; the representativeness of the training set affects a model’s performance through the lack of opportunity for the algorithm to learn, by not presenting it with relevant examples—similar to testing a student on material that was not taught. To solve the above two issues, we propose a semi-random data partitioning framework, in the setting of granular computing. While we discuss how the framework can address both issues, in this paper, we focus on avoiding class imbalance when partitioning the data, through the proposed approach. The results show that avoiding class imbalance results in better model performance.

[1]  W. Pedrycz,et al.  Granular computing and intelligent systems : design with information granules of higher order and higher type , 2011 .

[2]  Lotfi A. Zadeh,et al.  Fuzzy logic - a personal perspective , 2015, Fuzzy Sets Syst..

[3]  Richard L. Smith,et al.  PREDICTIVE INFERENCE , 2004 .

[4]  Han Liu,et al.  Granular computing-based approach for classification towards reduction of bias in ensemble learning , 2017, GRC 2017.

[5]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[6]  Didier Dubois,et al.  Bridging gaps between several forms of granular computing , 2016, Granular Computing.

[7]  Witold Pedrycz,et al.  Information granularity, big data, and computational intelligence , 2015 .

[8]  Carl-Erik Särndal,et al.  Model Assisted Survey Sampling , 1997 .

[9]  Fan Min,et al.  Semi-greedy heuristics for feature selection with test cost constraints , 2016 .

[10]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[11]  Han Liu,et al.  Fuzzy information granulation towards interpretable sentiment analysis , 2017, GRC 2017.

[12]  Gongping Yang,et al.  On the Class Imbalance Problem , 2008, 2008 Fourth International Conference on Natural Computation.

[13]  Edo Liberty,et al.  Stratified Sampling Meets Machine Learning , 2016, ICML.

[14]  Edy Portmann,et al.  Granular computing as a basis of human–data interaction: a cognitive cities use case , 2016, Granular Computing.

[15]  W. Pedrycz,et al.  Information granules and their use in schemes of knowledge management , 2011, Sci. Iran..

[16]  Shyi-Ming Chen,et al.  Granular Computing and Intelligent Systems , 2011 .

[17]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[18]  Vladik Kreinovich,et al.  Solving equations (and systems of equations) under uncertainty: how different practical problems lead to different mathematical and computational formulations , 2016 .

[19]  Yiyu Yao,et al.  Perspectives of granular computing , 2005, 2005 IEEE International Conference on Granular Computing.

[20]  Siti Mariyam Shamsuddin,et al.  Classification with class imbalance problem: A review , 2015, SOCO 2015.

[21]  Rushi Longadge,et al.  Class Imbalance Problem in Data Mining Review , 2013, ArXiv.

[22]  Lorenzo Livi,et al.  Granular computing, computational intelligence, and the analysis of non-geometric input spaces , 2016 .

[23]  Witold Pedrycz,et al.  Granular Computing and Decision-Making: Interactive and Iterative Approaches , 2015 .

[24]  P. Ducange,et al.  Multi-objective evolutionary design of granular rule-based classifiers , 2016 .

[25]  Han Liu,et al.  Rule-based systems: a granular computing perspective , 2016, Granular Computing.

[26]  Han Liu,et al.  Nature and biology inspired approach of classification towards reduction of bias in machine learning , 2016, 2016 International Conference on Machine Learning and Cybernetics (ICMLC).

[27]  Thomas J. Watson,et al.  An empirical study of the naive Bayes classifier , 2001 .

[28]  Han Liu,et al.  Rule Based Systems for Big Data , 2015 .

[29]  Georg Peters,et al.  DCC: a framework for dynamic granular clustering , 2016 .

[30]  Zhongzhi Shi,et al.  Machine learning as Granular Computing , 2009, 2009 IEEE International Conference on Granular Computing.

[31]  Han Liu,et al.  Unified Framework for Control of Machine Learning Tasks Towards Effective and Efficient Processing of Big Data , 2017 .

[32]  Andrzej Skowron,et al.  Interactive granular computing , 2015, Granular Computing.

[33]  Edward R. Dougherty,et al.  Effect of separate sampling on classification accuracy , 2014, Bioinform..