In the past decade, there has been a transformative and permanent revolution in computer vision cultivated by the reinvigorated adoption of deep learning for visual understanding tasks. Driven by the increasing availability of large annotated data sets, efficient training techniques, and faster computational platforms, deep-learning-based solutions have been progressively employed in a broader spectrum of applications from image classification to activity recognition. Deep learning, in general, refers to a range of artificial neural networks that consist of multiple layers, mimicking the structure and cognitive process of the human brain. Instead of relying on handcrafted features, they allow the acquisition of knowledge directly from data. They regress intricate objective functions in a nested hierarchy, where more sophisticated representations with larger receptive fields computed in terms of less abstract ones with localized supports. Deep learning also makes it possible to incorporate explicit domain knowledge and replace a large variety of conventional algorithmic blocks with trainable differentiable modules. These all give deep learning an exceptional power and flexibility in modeling the relationship between the input data and target output.