Beyond Language and Vision, Towards Truly Multimedia Integration

Text has been the dominant medium for understanding the world around us. More recently, because of the increasingly amount of visual content without text, visual with computer vision technology has been making wave in a number of visual oriented applications such as fashion, home furnishing and product search, as well as rich media descriptions. However, our world and our perception of our surrounding is multi-sensory and multimodal in nature, there are many signals beyond text and vision that should be leveraged to offer us a better understanding of our environments. Take the applications in mobility, wellness and user profiling as examples, we not only need textual and visual content, but also location, acoustic, sensory, and various structured knowledge of the environment. These data can come from many sources in practice, including the social media sites. However, there are many technical challenges in such fusion process, including the aggregation of data of different modalities and heterogeneous sources, and the modelling of consistency and complementarity of information arising from these sources. This talk describes our current research on multimodal multi-task learning model to integrate a wide variety of information to tackle the above mentioned application examples. We further discuss issue of privacy in this line of research.