Introduction to the Special Section on Intelligent Multimedia Systems and Technology Part II

With the explosion of video and image data available on the Internet, desktops, and mobile devices, intelligent multimedia systems and technologies are becoming more and more important. Extracting semantics and other useful information from multimedia data to facilitate online and local multimedia content analysis, search, and other related applications has also gained more and more attention from both academia and industry. Machine learning and data mining have proven to be promising approaches in many data-intensive applications, and many efforts have also been dedicated to multimedia data. As we mentioned in the first part of this special issue, which was published in February 2011, this special section contains the second group of articles on this subject. The objective of this special section is the same as the first issue, that is, to bring together the latest research in intelligent multimedia systems and technologies. We seek effective machine-learning and data-mining algorithms, frameworks, systems, and implementations that particularly work on multimedia data (including image, video, and audio, which may also be associated with textual information). The focus is to identify real challenges in intelligent multimedia systems and technology and to investigate practical solutions to the core problems of multimedia applications in both theoretical and practical perspectives. This section contains eight articles that are organized into three parts. The first part contains four articles on machine-learning techniques for multimedia content understanding and tracking. In the first article, Yang and Chen give a comprehensive review of the methods for music emotion learning and recognition, as well as discussions on open issues and future research directions. Ewerth et al. propose a transductive learning framework for robust video content analysis based on feature selection and ensemble classification, which applies to multiple tasks including shot boundary detection, face recognition, semantic video retrieval, and semantic indexing of computer game sequences. Suk et al. introduce a knowledge-based hybrid method for human motion recognition, which shows how a machine-learning algorithm can learn from one media type (3D motion capture) to better classify another related media type (2D video). Zhang et al. present an appearance model-based visual tracking algorithm by simulating the sparse coding and feature-based visual attention mechanisms of the human visual system. Part two consists of two articles on the topic of visual features for various of multimedia tasks and applications. Ji et al. give a systematic exploration of context information in designing an interest point detector. The work integrates contextual cues to enhance the interest point detector from the traditional local scale to a semi-local scale, which enables discovering more meaningful and discriminative salient regions without losing detector repeatability. Berretti et al. present a feature selection approach for 3D face recognition, which has been used to identify the facial features in the recognition of different ethnic groups and faces with different expressions. Part three contains two articles that address two different multimedia applications. Zhang et al. introduce a generic framework for analyzing a relatively large-scale diverse sports video dataset with three video analysis tasks in a coherent and sequential order. Leung et al. develop an adaptive search engine architecture and a robust adaptive index update strategy for social media sharing, which enable the system to improve its performance over time.