Long gone are the days of a video surveillance system capable of processing only one video stream acquired by a single fixed camera. In those days algorithms were tested in laboratory environments, with a small number of people moving orderly and with limited clutter in the scene. Modern monitoring systems have much more demanding requirements: large, busy and complex scenes, the use of heterogeneous sensor networks, the real-time acquisition and interpretation of the evolving scene; instantaneous flagging of potentially critical situations in any weather and illumination conditions. Moreover, operators expect the real-time description of scene evolution in natural language of any type of expected and unexpected event, involving a variety of situations, from nobody in the scene to groups of people, and in some cases very crowded environments. The monitoring of public and private spaces has become a necessity, because of the steady increase in