Multimodal multimedia metadata fusion

The past decade has witnessed a phenomenal growth in both the volume and the variety of information. From a mostly textual world, we now see the widespread use of multimedia data such as videos, images, bio-sequences, etc. Such changes have led to increasing demands for improved data mining techniques to cope with the added complexity. In this dissertation, we focus on information fusion techniques for mapping multimedia data to semantics. Multimedia data carry multimodal information in the forms of semantics, context, and content, of which content can consist of visual, audio and textual information. Our work first extracts multimodal features and identifies individual modalities. Once modalities have been identified, we quantify similarity measures for each modality. Many distance measures for multimedia data are non-metric in nature (resulting in non-positive-definite similarity matrices). Kernel machines, with their spectacular results on diverse datasets, however, work only with positive semi-definite matrices. We employ the approach of spectrum transformation to generate a positive semi-definite kernel matrix. Once the individual modalities have been identified and individual distance measures have been designed, we use super-kernel fusion and Bayesian inference learning to fuse multiple modalities in a query-dependent way. We also studied two important applications for multimodal multimedia data fusion: video event recognition and video structure analysis. Detecting hazardous events from videos has spurred new research for security concerns. We present a framework for multi-camera video surveillance, which consists of three phases: detection, representation, and recognition. The detection phase handles spatio-temporal data fusion from multiple cameras for extracting motion data. The representation phase constructs content-rich descriptions of the motion events. The recognition phase deals with suspicious event identification based on the data descriptors. Detecting video shot boundaries provides the ground for nearly all existing video analysis and segmentation algorithms. A shot transition takes place where inter-frame difference is perceptually significant. We use the dynamic partial function as the inter-frame difference measurement to detect perceptual discontinuity, and hence the boundary of a shot. Through theoretical analysis and extensive empirical studies, we show that our proposed approaches are able to perform more effectively, and efficiently, than traditional methods.