The Effectiveness of Internet Content Filters

As part of its defense of the Child Online Protection Act, which seeks to prevent minors from viewing commercially published harmful-to-minors material on the World Wide Web, the U.S. Department of Justice commissioned a study of the prevalence of “adult” materials and the effectiveness of Internet content filters in blocking them. As of 2005–2006, about 1.1% of webpages indexed by Google and MSN were adult—hundreds of millions of pages. About 6% of a set of 1.3 billion searches executed on AOL, MSN and Yahoo! in summer 2005 retrieved at least one adult webpage among the first ten results, and about 1.7% of those results are adult webpages. These estimates are based on both simple random samples of webpages indexed by search engines and on a stratified random sample of searches. Webpages with sexually explicit content intended for adult entertainment (i.e., not in an educational, medical or artistic context) were used to test a variety of Internet content filters for underblocking—failing to block webpages that they are intended to block. A random sample of “clean” webpages with no sexual content or reference to sex was used to test the filters for overblocking—blocking webpages they are not intended to block. Webpages retrieved by the most popular searches according to Wordtracker were also categorized and used to test filters. Generally, filters with lower rates of underblocking had higher rates of overblocking. If the filter most effective at blocking adult materials were applied to search indexes, typical query results, or the results of popular queries, the number of clean pages blocked in error would exceed the number of adult pages blocked correctly.  This work was supported by the United States Department of Justice. I testified on behalf of the United States at trial in Am. Civil Liberties Union v. Gonzales, 478 F. Supp. 2d 775 (E.D. Pa. 2007), and submitted expert declarations in the related matter of Gonzales v. Google, 234 F.R.D. 674, 688 (N.D. Ca. 2006). Much of the data collection was performed by a group led by Paul Mewett at CRA International. I am grateful to David Freedman, Raphael Gomez, Theodore Hirt, Joel McElvain and an anonymous referee for helpful conversations or comments on an earlier draft. 