Indexing audiovisual databases through joint audio and video processing

This work deals with the representation of audiovisual information, to organize its content for future tasks such as retrieval and information browsing. Some indications are provided to demonstrate that a cross‐modal analysis of simple visual and audio information is sufficient to organize an audiovisual sequence into semantically meaningful segments. Each segment defines a scene which is coherent from some semantic point of view. Depending on the sophistication of the cross‐modal analysis, the scene may represent either a generic story unit or more complex situations such as dialogues or actions. The results shown in this work indicate that audio classification is key in establishing relationships among consecutive shots, allowing us to reach a scene‐level description. A higher abstraction level can be reached when a correlation exists among nonconsecutive shots, defining what is called “video idioms.” Accordingly, a generic audio model is proposed: a linear combination of four classes of audio signals. For semantic purposes, it is meaningful to select the classes so that they can serve any subsequent scene characterization. When several audio sources are combined simultaneously, it is assumed that only one is linked to the semantic of the scene, and that it corresponds to the dominant class of audio (in energy terms). The different classes that identify each type of audio are selected to facilitate any decision related to a semantic characterization of the audiovisual information. The problem therefore lies in a source separation task. The proposed scheme classifies the audio signal into the following four component types: speech, music, silence, and miscellaneous other sounds. Its performance are quite satisfactory (∼90%) and were tested extensively using various types of source material. Considering a generic audiovisual sequence, video shots are merged according to this audio classification. Depending on the type of source material (broadcast news, commercials, documentaries, and movies), different types of scenes can be identified, e.g., a single advertisement in the case of commercials; a dialogue situation in a movie. The article describes some experimental simulations in these different fields. © 1998 John Wiley & Sons, Inc. Int J Imaging Syst Technol, 9, 320–331, 1998