Abstract This study proposes a novel deep learning approach for the fusion of 2D and 3D modalities in in-the-wild facial expression recognition (FER). Different from other studies, we exploit the 3D facial information in in-the-wild FER. In particular, in-the-wild 3D FER dataset is not widely available; therefore, 3D facial data are constructed from available 2D datasets thanks to recent advances in 3D face reconstruction. The 3D facial geometry features are then extracted by deep learning technique to exploit the mid-level details, which provides meaningful expression for the recognition. In addition, to demonstrate the potential of 3D data on FER, the 2D projected images of 3D faces are taken as additional input to FER. These features are then jointly fused with 2D features obtained from the original input. The fused features are then classified by support vector machines (SVMs). The results show that the proposed approach achieves state-of-the-art recognition performances on Real-World Affective Faces (RAF) and Static Facial Expressions in the Wild (SFEW 2.0), and AffectNet dataset. This approach is also applied to a 3D FER dataset, i.e. BU-3DFE, to compare the effectiveness of reconstructed and available 3D face data for FER. This is the first time such a deep learning combination of 3D and 2D facial modalities is presented in the context of in-the-wild FER.