A New Way of Handling Missing Data in Multi-source Classification Based on Adaptive Imputation

Data fusion is an interesting methodology for improving the classification performance. It consists in combining data acquired from multiple sources for more informative decision and better decision making. This latter is a challenging task due to many issues. The main of these issues arises from the data to be fused. Missing data presents one of the issues, their presence affects the performance of the algorithms and results on a misleading prediction. Appropriately handling missing data is crucial for accurate inference. Several approaches have been proposed in the literature to deal with multi-source classification problems, however they neglect the presence of missingness in the data and assume that the data are complete which is not the case in real life. Other approaches use directly simple data imputation before the learning process, which is not always enough to obtain a reliable learning and prediction model. In this paper, we propose a new approach to deal with missing data in multi-source classification problem. In our approach, we avoid the direct imputation when the concerned feature is not important, but we also adjust the predictions fusion process based on the missing data rate in each data source and in the new instance to classify. This approach is used with Random Forests as an ensemble classifier, and it has shown improved classification performance compared to existing approaches.