Fusion of multimodal imaging data with nonimaging data is critically important for a more complete understanding of the disease characteristics and therefore essential to accurate computer-aided diagnosis. However, there are two major challenges. 1) Effective discovery of the discriminative multimodal information during the fusion process is hindered by the large dimension gap between raw medical images and clinical factors. 2) Interpreting the complex nonlinear cross-modal association, especially in deep-network-based fusion models, remains an unsolved challenge, which is essential for uncovering the disease mechanism. To address the two challenges, we propose an Interpretable Deep Multimodal Fusion (DMFusion) Framework based on Deep Canonical Correlation Analysis (CCA). Specifically, a novel DMFusion loss is proposed to optimize the discovery of discriminative multimodal representations in low-dimensional latent fusion space. It is achieved by jointly exploiting intermodal correlational association via CCA loss and intra-modal structural and discriminative information via reconstruction loss and cross-entropy loss. For interpreting the nonlinear cross-modal association in DMFusion network, we propose a cross-modal association (CA) score to quantify the importance of input features towards the correlated association, by harnessing integrated gradients in deep networks and canonical loading in CCA projection. The proposed fusion framework was validated on the differential diagnosis of demyelinating diseases in Central Nervous System (CNS) and outperformed six state-of-the-art methods on three fusion tasks.