A generic classification system for multi-channel audio indexing: Application to speech and music detection

There is a rise in the number 3D audio-visual productions and archives that creates a need for indexation of 3D contents. Event detection using audio modality is a difficult task. The standard way to do classification on 3D audio is to first down-mix to mono audio and classify on that. In this paper, we describe a generic classifier for multi-channel audio event detection and propose several information fusion strategies. Our system is evaluated on a speech and music detection task on the audio of 3D movies. We improve the classification performances on our database by 1.5% for speech detection, and 8% for music detection, compared to the standard downmixing method. We also provide a comparison of several information fusion methods in the experiments.