Video CAPTCHAs: Usability vs. Security

A Completely Automated Public Turing test to tell Computer and Humans Apart (CAPTCHA) is a variation of the Turing test, in which a challenge is used to distinguish humans from computers (‘bots’) on the internet. They are commonly used to prevent the abuse of online services; for example, malicious users have written automated programs that signup for thousands of free email accounts and send SPAM messages. A number of hard artificial intelligence problems including natural language processing, speech recognition, character recognition, and image understanding have been used as the basis for these tests, on the expectation that humans will outperform bots. The most common type of CAPTCHA requires a user to transcribe distorted characters displayed within a noisy image. Unfortunately, many users find existing character-recognition based CAPTCHAs frustrating and attack success rates as high as 60% have been reported for Microsoft’s Hotmail CAPTCHA [8]. To address these problems, we present a first attempt at using content-based video labeling (‘tagging’) as a CAPTCHA task. We define correct responses using tags provided by the individual that posts a video to a public database (YouTube.com), along with tags on videos designated as being ‘related’ in the database. In an experiment involving 184 human participants, we were able to increase human pass rates on our video CAPTCHAs from roughly 70% to 90% while keeping the success of a frequency-based attack fixed at around 13%. Through a different parameterization of the challenge generation and tag matching algorithms, we were able to reduce the success rate of the same attack to 2%, while still increasing the human pass rate to 75% [5]. The frequency-based attack we consider is simple but logical for this type of CAPTCHA: the computer submits the three tags with the highest estimated frequencies below the rejection threshold, on the assumption that the tag frequency estimates used in creating the CAPTCHAs are publicly available. A screenshot of our video-based CAPTCHA is shown in Figure 1. To pass the challenge, a user provides three words (‘tags’) describing the video. If one of the submitted tags belongs to the automatically generated ground truth tag set, the challenge is passed. This task is similar to the ESP game of von Ahn et al. [7], in which online users are randomly paired and presented with an image that they then submit tags to describe. Players cannot see each other’s submitted tags until they agree on a common tag, at which point the round of the game ends. Our video CAPTCHA is similar to a game of ESP in which one player is online, while the other player’s responses (the ground truth tags) are computed automatically.