Unbalanced data processing using oversampling: Machine Learning

Abstract Nowadays, the DL algorithms show good results when used in the solution of different problems which present similar characteristics as the great amount of data and high dimensionality. However, one of the main challenges that currently arises is the classification of high dimensionality databases, with very few samples and high-class imbalance. Biomedical databases of gene expression microarrays present the characteristics mentioned above, presenting problems of class imbalance, with few samples and high dimensionality. The problem of class imbalance arises when the set of samples belonging to one class is much larger than the set of samples of the other class or classes. This problem has been identified as one of the main challenges of the algorithms applied in the context of Big Data. The objective of this research is the study of genetic expression databases, using conventional methods of sub and oversampling for the balance of classes such as RUS, ROS and SMOTE. The databases were modified by applying an increase in their imbalance and in another case generating artificial noise.