Effect of fusing features from multiple DCNN architectures in image classification

Automatic image classification has become a necessary task to handle the rapidly growing digital image usage. It has branched out many algorithms and adopted new techniques. Among them, feature fusion-based image classification methods rely on hand-crafted features traditionally. However, it has been proven that the bottleneck features extracted through pre-trained convolutional neural networks (CNNs) can improve the classification accuracy. Thence, this study analyses the effect of fusing such cues from multiple architectures without being tied to any hand-crafted features. First, the CNN features are extracted from three different pre-trained models, namely AlexNet, VGG-16, and Inception-V3. Then, a generalised feature space is formed by employing principal component reconstruction and energy-level normalisation, where the features from individual CNN are mapped into a common subspace and embedded using arithmetic rules to construct fused feature vectors (FFVs). This transformation play a vital role in creating a representation that is appearance invariant by capturing complementary information of different high-level features. Finally, a multi-class linear support vector machine is trained. The experimental results demonstrate that such multi-modal CNN feature fusion is well suited for image/object classification tasks, but surprisingly it has not been explored so far by the computer vision research community extensively.