Automatic music tagging with Harmonic CNN

Feature design was one of the main focuses in early stages of music informatics research (MIR), where such features were later used as input to machine learning models to, e.g., bridge the semantic gap [2] between signal-level features and high-level music semantics. However, with the emergence of deep learning, recent MIR models can learn feature representations in an end-to-end data-driven way. Hence, minimum domain knowledge is required in the preprocessing step (i.e., short-time Fourier transform). Recent works, such as sample-level CNN [6], use raw audio waveforms directly as their inputs. With no domain knowledge in its architecture design and preprocessing, sample-level CNN yielded state-of-the-art results in music tagging [4].