Improving learning accuracy by using synthetic samples for small datasets with non-linear attribute dependency

Small-data problems are commonly encountered in the early stages of a new manufacturing procedure, presenting challenges to both academics and practitioners, as good performance is difficult to achieve with learning models when there is a lack of sufficient data. Virtual sample generation (VSG) has been shown to be an effective method to overcome this issue in a wide range of studies in various fields. Such works usually assume that the relations among attributes are independent of each other, and produce synthetic data by using sample distributions of these. However, the VSG technique may be ineffective if the real data has interrelated attributes. Therefore, this research provides a novel procedure to generate related virtual samples with non-linear attribute dependency. To construct a relational model between the independent and dependent attributes, we employ gene expression programming (GEP) to find the most suitable mathematical model. One practical dataset and three real UCI datasets are presented in this paper to verify the effectiveness of the proposed method, and the results show that the proposed approach has better learning accuracy with regard to a back-propagation neural (BPN) network than that of the well-known mega-trend-diffusion (MTD) and the multi regression analysis (MRA) approaches.

[1]  H. Karimi,et al.  Modeling thermal conductivity augmentation of nanofluids using diffusion neural networks , 2011 .

[2]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Cândida Ferreira,et al.  Gene Expression Programming: A New Adaptive Algorithm for Solving Problems , 2001, Complex Syst..

[4]  Der-Chiang Li,et al.  Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge , 2007, Comput. Oper. Res..

[5]  David E. Goldberg,et al.  Genetic Algorithms, Tournament Selection, and the Effects of Noise , 1995, Complex Syst..

[6]  Der-Chiang Li,et al.  Using structure-based data transformation method to improve prediction accuracies for small data sets , 2012, Decis. Support Syst..

[7]  Der-Chiang Li,et al.  Extending Attribute Information for Small Data Set Classification , 2012, IEEE Transactions on Knowledge and Data Engineering.

[8]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[9]  Long-Sheng Chen,et al.  Using Functional Virtual Population as assistance to learn scheduling knowledge in dynamic manufacturing environments , 2003 .

[10]  L. Darrell Whitley,et al.  The GENITOR Algorithm and Selection Pressure: Why Rank-Based Allocation of Reproductive Trials is Best , 1989, ICGA.

[11]  Der-Chiang Li,et al.  The data complexity index to construct an efficient cross-validation method , 2010, Decis. Support Syst..

[12]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[13]  Der-Chiang Li,et al.  Using mega-fuzzification and data trend estimation in small data set learning for early FMS scheduling knowledge , 2006, Comput. Oper. Res..

[14]  Der-Chiang Li,et al.  Employing box-and-whisker plots for learning more knowledge in TFT-LCD pilot runs , 2012 .

[15]  Kumar,et al.  Neural Networks a Classroom Approach , 2004 .

[16]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[17]  Claudio Moraga,et al.  A diffusion-neural-network for learning from small samples , 2004, Int. J. Approx. Reason..