Special section on learning from multiple evidences for large scale multimedia analysis

With the popularity of digital cameras and smartphones, an explosive growing number of multimedia data are generated every day. The size of personal and internet image/video collections keeps growing rapidly. In the Web 2.0 era, the success of social multimedia websites, such as Facebook, Flickr, and Youtube, provides us a plenty of internet multimedia data. Even in a personal digital archive, there may be over ten thousand pictures and the length of video data could be over hundreds hours. Therefore, effective and efficient multimedia data analysis, which substantially benefits multimedia data utilization and management at large scales, turns into one of the greatest research challenge in the community. The information obtained from multimedia data consists of multiple evidences, e.g., internet images are usually accompanied with a textual description and social network metadata. Learning from such multiple evidences for large scale multimedia content analysis is an interesting research topic, with a range of important applications, such as multimedia retrieval, multimedia event detection, concept detection, indexing, etc. For example, it has been reported in several recent papers that combining metadata with low level features would benefit web image analysis. As another example, the 15-year Informedia project at Carnegie Mellon University has demonstrated that combining Automatic Speech Recognition (ASR) and Optical Character Recognition (OCR) with visual features usually yields higher multimedia event detection accuracy than only using visual features. It is therefore a promising research direction to appropriately exploit multiple evidences derived from visual, auditory, textual features and social metadata. This special issue is presenting the latest research on combing multiple evidences for multimedia analysis. Among the 18 submissions, 5 were accepted by this special issue. Given an action specified by a user, Nga and Yanai propose a novel method to automatically retrieve the video shots of that action from Internet by jointly exploiting the metadata and visual features of web videos. An experiment on large scale dataset demonstrates that combing the two cues would help reduce human labor for building action dataset, compared to the traditional exhausted manual way.