Segmenting Egocentric Videos to Highlight Personal Locations of Interest

With the increasing availability of wearable cameras, the acquisition of egocentric videos is becoming common in many scenarios including law enforcement, assistive technologies, and life-logging. However, the absence of explicit structure in such videos (e.g., video chapters), makes their exploitation difficult. Depending on the considered goal, long egocentric videos tend to contain much uninformative content like for instance transiting through a corridor, walking, or driving to the office. Therefore, automated tools are needed to enable faster access to the information stored in such videos and index their visual content. Towards this direction, researches have investigated methods to produce short informative video summaries, recognize the actions performed by the wearer, and segment the videos according to detected ego-motion patterns. While current literature focuses on providing generalpurpose methods which are usually optimized on data acquired by many users, we argue that, given the subjective nature of egocentric videos, more attention should be devoted to user-specific methods. More specifically, we propose to segment unstructured egocentric videos into coherent shots related to user-specified personal locations of interest. We consider a personal location as: a fixed, distinguishable spatial environment in which the user can perform one or more activities which may or may not be specific to the considered location. According to this notion, a personal location is specified at the instance level (e.g., my kitchen, my office, my car), rather than at the category level (e.g., a kitchen, an office, a car). Given a set of personal locations of interest to be considered for the segmentation of egocentric sequences, the task is to understand (for every frame in the video) if it is related to either one of the considered personal locations of interest or none of them (in which case it will be referred to as a negative sample to be rejected). Figure 1 shows a schema of the investigated problem: given an input video and minimal user-specified training data (i.e., short video-clips of the personal locations of interest), the system should be able to segment the video highlighting the presence of the considered locations of interest as well as rejecting the negative frames. In a real-world car office c.v.m. car kitchen input video