Unsupervised clustering of ambulatory audio and video

A truly personal and reactive computer system should have access to the same information as its user, including the ambient sights and sounds. To this end, we have developed a system for extracting events and scenes from natural audio/visual input. We find our system can (without any prior labeling of data) cluster the audio/visual data into events, such as passing through doors and crossing the street. Also, we hierarchically cluster these events into scenes and get clusters that correlate with visiting the supermarket, or walking down a busy street.