Abstract Automatic affect recognition in real-world environments is a challenging task due to uncontrolled conditions that exist in such environments. Most studies in the literature focused on creating methods for laboratory settings and for categorical emotions. However, in recent years a shift has been observed in the affective computing community towards continuous emotion recognition for naturalistic settings. In this chapter we aim at (i) highlighting the differences between real-world and laboratory settings, (ii) describing emotions for audio and video-based recognition, and (iii) presenting the current state of the affective computing community. Finally, we illustrate a multimodal (audiovisual) continuous emotion recognition system based on deep end-to-end learning and provide experimental results for the RECOLA database.