Multi-Task Deep Relative Attribute Learning for Visual Urban Perception

Visual urban perception aims to quantify perceptual attributes (e.g., safe and depressing attributes) of physical urban environment from crowd-sourced street-view images and their pairwise comparisons. It has been receiving more and more attention in computer vision for various applications, such as perceptive attribute learning and urban scene understanding. Most existing methods adopt either 1) a regression model trained using image features and ranked scores converted from pairwise comparisons for perceptual attribute prediction or 2) a pairwise ranking algorithm to independently learn each perceptual attribute. However, the former fails to directly exploit pairwise comparisons while the latter ignores the relationship among different attributes. To address them, we propose a multi-task deep relative attribute learning network (MTDRALN) to learn all the relative attributes simultaneously via multi-task Siamese networks, where each Siamese network will predict one relative attribute. Combined with deep relative attribute learning, we utilize the structured sparsity to exploit the prior from natural attribute grouping, where all the attributes are divided into different groups based on semantic relatedness in advance. As a result, MTDRALN is capable of learning all the perceptual attributes simultaneously via multi-task learning. Besides the ranking sub-network, MTDRALN further introduces the classification sub-network, and these two types of losses from two sub-networks jointly constrain parameters of the deep network to make the network learn more discriminative visual features for relative attribute learning. In addition, our network can be trained in an end-to-end way to make deep feature learning and multi-task relative attribute learning reinforces each other. Extensive experiments on the large-scale Place Pulse 2.0 dataset validate the advantage of our proposed network. Our qualitative results along with visualization of saliency maps also show that the proposed network is able to learn effective features for perceptual attributes.

[1]  Kristen Grauman,et al.  Decorrelating Semantic Visual Attributes by Resisting the Urge to Share , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Xinge You,et al.  Diverse Expected Gradient Active Learning for Relative Attributes , 2014, IEEE Transactions on Image Processing.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Nathan Jacobs,et al.  Revisiting IM2GPS in the Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Yong Jae Lee,et al.  Discovering the Spatial Extent of Relative Attributes , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[7]  Radomír Mech,et al.  Deep Multi-patch Aggregation Network for Image Style, Aesthetics, and Quality Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[9]  M. Shamim Hossain,et al.  Deep Relative Attributes , 2016, IEEE Transactions on Multimedia.

[10]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[11]  Bolei Zhou,et al.  Landscape and Urban Planning , 2018 .

[12]  Henriette Cramer,et al.  Aesthetic capital: what makes london look beautiful, quiet, and happy? , 2014, CSCW.

[13]  Ramesh Raskar,et al.  Streetscore -- Predicting the Perceived Safety of One Million Streetscapes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[14]  Vikas Singh,et al.  Efficient Relative Attribute Learning Using Graph Neural Networks , 2018, ECCV.

[15]  Bastian Leibe,et al.  Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jonathan Krause,et al.  Fine-Grained Car Detection for Visual Census Estimation , 2017, AAAI.

[17]  Yong Jae Lee,et al.  End-to-End Localization and Ranking for Relative Attributes , 2016, ECCV.

[18]  Ian Davidson,et al.  Learning Multiple Relative Attributes With Humans in the Loop , 2014, IEEE Transactions on Image Processing.

[19]  Liang Lin,et al.  Place-centric Visual Urban Perception with Deep Multi-instance Regression , 2017, ACM Multimedia.

[20]  Shiguang Shan,et al.  Heterogeneous Face Attribute Estimation: A Deep Multi-Task Learning Approach , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Trevor Darrell,et al.  PANDA: Pose Aligned Networks for Deep Attribute Modeling , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  César A. Hidalgo,et al.  The Collaborative Image of The City: Mapping the Inequality of Urban Perception , 2013, PloS one.

[23]  Xinhang Song,et al.  Multi-Scale Multi-Feature Context Modeling for Scene Recognition in the Semantic Manifold , 2017, IEEE Transactions on Image Processing.

[24]  Long Chen,et al.  Multi-task Relative Attribute Prediction by Incorporating Local Context and Global Style Information , 2016, BMVC.

[25]  Ehsan Adeli,et al.  Deep Relative Attributes , 2015, ACCV.

[26]  Vicente Ordonez,et al.  Learning High-Level Judgments of Urban Perception , 2014, ECCV.

[27]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[28]  Kevin Lynch,et al.  The Image of the City , 1960 .

[29]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[30]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Ran He,et al.  Deep Aesthetic Quality Assessment With Semantic Information , 2016, IEEE Transactions on Image Processing.

[32]  Jonathan Krause,et al.  Using deep learning and Google Street View to estimate the demographic makeup of neighborhoods across the United States , 2017, Proceedings of the National Academy of Sciences.

[33]  Jianxiong Xiao,et al.  What makes an image memorable? , 2011, CVPR 2011.

[34]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Henriette Cramer,et al.  Describing and Understanding Neighborhood Characteristics through Online Social Media , 2015, WWW.

[37]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Lun Wu,et al.  Social sensing from street-level imagery: A case study in learning spatio-temporal urban mobility patterns , 2019, ISPRS Journal of Photogrammetry and Remote Sensing.

[39]  Ramesh Raskar,et al.  Deep Learning the City: Quantifying Urban Perception at a Global Scale , 2016, ECCV.

[40]  Anoop Cherian,et al.  DeepPermNet: Visual Permutation Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Eric P. Xing,et al.  Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity , 2009, ICML.

[42]  Gang Wang,et al.  Multi-Task CNN Model for Attribute Prediction , 2015, IEEE Transactions on Multimedia.

[43]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Ramesh Raskar,et al.  Computer vision uncovers predictors of physical urban change , 2017, Proceedings of the National Academy of Sciences.

[45]  Alexei A. Efros,et al.  City Forensics: Using Visual Elements to Predict Non-Visual City Attributes , 2014, IEEE Transactions on Visualization and Computer Graphics.

[46]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[47]  Baoxin Li,et al.  Predicting Multiple Attributes via Relative Multi-task Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Adriana Kovashka,et al.  Learning Attributes from Human Gaze , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[49]  Yu Cheng,et al.  Fully-Adaptive Feature Sharing in Multi-Task Networks with Applications in Person Attribute Classification , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Bolei Zhou,et al.  Recognizing City Identity via Attribute Analysis of Geo-tagged Images , 2014, ECCV.

[51]  Timnit Gebru,et al.  Fine-Grained Recognition in the Wild: A Multi-task Domain Adaptation Approach , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[52]  Changsheng Xu,et al.  Multimodal Spatio-Temporal Theme Modeling for Landmark Analysis , 2014, IEEE MultiMedia.

[53]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[55]  Changsheng Xu,et al.  Exploiting Social-Mobile Information for Location Visualization , 2017, ACM Trans. Intell. Syst. Technol..

[56]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.