Tubekit: a query-based youtube crawling toolkit

Realizing the importance of capturing contextual information for digital video preservation, we proposed a model of ContextMiner [Shah and Marchionini, 2007a] and demonstrated how to materialize this concept [Shah and Marchionini, 2007b]. Extending this notion further to enable a digital curator to work with a dynamic collection, we decided to use US presidential election videos from YouTube [Shah and Marchionini, 2007c] and understand the role of contextual information in analyzing and explaining various collection development issues for a digital library or an archive. In order to obtain videos, metadata, and contextual information from YouTube, we built a crawler. This crawler is querybased; it uses a set of seed queries to search on YouTube and obtain the rank list with the top 100 videos for a query. The crawler then collects a set of attributes for each video. Some of them are static, such as title, description and tags, and are considered as metadata; the others are dynamic, such as views, comments and ratings, and are considered as the contextual information. We decided to collect contextual information for the videos in our collection every day. This crawler has been running since May 2007. In the time that followed we found several other topics for which we felt the need to collect similar data from YouTube. This need drove us to create a much broader and general framework that could allow us to build query-based crawlers for YouTube for any topic. We present TubeKit a toolkit for creating YouTube crawlers. It allows one to build one’s own crawler that can crawl YouTube based on a set of seed queries and collect up to 24 different attributes at regular intervals. TubeKit assists in all the phases of this process starting database creation to finally preparing analysis reports from the collected data.