CAPTCHA Challenge Tradeoffs: Familiarity of Strings versus Degradation of Images

It is a well documented fact that, for human readers, familiar text is more legible than unfamiliar text. Current-generation computer vision systems also are able to exploit some kinds of prior knowledge of linguistic context: for example, many OCR systems can use known lexica (word-lists, such as of commonly occurring English words) to disambiguate interpretations. It is interesting that human readers can exploit various degrees of familiarity; for example, strings of characters which, while not found in dictionaries, are similar to spelled words: e.g. "pronounceable" strings, or strings made up of frequently occurring character n-grams. In contrast to this, computer vision technologies for exploiting such poorly characterized constraints (absent an explicit, complete lexicon) are not yet well developed. This gap in ability may allow us to design stronger CAPTCHAs. We measure the familiarity of challenge strings generated by four methods (described by Bentley and Mallows) and we use the ScatterType CAPTCHA to degrade challenge images. We report the results of a human legibility trial which supports the hypothesis that more familiar strings are indeed more legible in CAPTCHAs. Our measurements may enable engineering CAPTCHAs with a more uniform distribution of difficulty by balancing image degradations against familiarity