Online video analysis for abnormal event detection and action recognition

Automatic video surveillance has become one of the most active research areas in computer vision. Its applications are vast; these include security purposes, patient monitoring and law enforcement. Considering that millions of cameras operate all over the world, human surveillance is impractical for many reasons. Perhaps the most important reason is that strictly speaking, we require one person to monitor one camera. This monitoring is not only unrealistic but also inefficient because we cannot have a person 24/7 observing a scene. Even if that would be possible, fatigue and distractions might deter its efficiency. The main challenge of video surveillance is that it requires online processing (no-cumulative delay process) for practical scenario purposes. The reason is that the system’s response should be given immediately after the event occurred. If this time requirement is not satisfied, the system will end up warning the operators minutes or hours later. Then, the system’s response will be impractical for some events (e.g. crimes, accidents and fires) where the response times are critical. Although many methods have been developed for video surveillance, there is very little in terms of online-based methods. The lack of online approaches has been because there is a trade-o. between accuracy in detecting events and computational complexity. The objective of this thesis is to minimise the gap of the speed-accuracy trade-o.. To this end, this thesis proposes: (I) multi-source motion extraction to boost accuracy and expand the type of events to be detected, (II) extract few but high descriptive features via multi-scale extraction with perspective compensation, and (III) four fast binary-based video descriptors. The main findings of this thesis are as follows: First, multi-scaled perspective features reduce computational times meeting online requirements in abnormal event detection. Second, binary video features achieve competitive accuracy in action recognition compared with existing features while drastically outperform them in terms of computational complexity. In conclusion, first, by carefully selecting the spatio-temporal regions to process video data significantly improves accuracy and at the same time reduces computational times to detect abnormal events. Second, binary video features can compete with existing features by selecting a limited number of descriptive spatio-temporal symmetric regions. Finally, the findings of this thesis could benefit all those video applications that require real-time or online processing times.