ConceptDoppler: a weather tracker for internet censorship

The text of this paper has passed across many Internet routers on its way to the reader, but some routers will not pass it along unfettered because of censored words it contains. We present two sets of results: 1) Internet measurements of keyword filtering by the Great “Firewall” of China (GFC); and 2) initial results of using latent semantic analysis as an efficient way to reproduce a blacklist of censored words via probing. Our Internet measurements suggest that the GFC’s keyword filtering is more a panopticon than a firewall, i.e., it need not block every illicit word, but only enough to promote self-censorship. China’s largest ISP, ChinaNET, performed 83.3% of all filtering of our probes, and 99.1% of all filtering that occurred at the first hop past the Chinese border. Filtering occurred beyond the third hop for 11.8% of our probes, and there were sometimes as many as 13 hops past the border to a filtering router. Approximately 28.3% of the Chinese hosts we sent probes to were reachable along paths that were not filtered at all. While more tests are needed to provide a definitive picture of the GFC’s implementation, our results disprove the notion that GFC keyword filtering is a firewall strictly at the border of China’s Internet. While evading a firewall a single time defeats its purpose, it would be necessary to evade a panopticon almost every time. Thus, in lieu of evasion, we propose ConceptDoppler, an architecture for maintaining a censorship “weather report” about what keywords are filtered over time. Probing with potentially filtered keywords is arduous due to the GFC’s complexity and can be invasive if not done efficiently. Just as an understanding of the mixing of gases preceded effective weather reporting, understanding of the relationship between keywords and concepts is essential for tracking Internet censorship. We show that LSA can effectively pare down a corpus of text and cluster filtered keywords for efficient probing, present 122 keywords we discovered by probing, and underscore the need for tracking and studying censorship blacklists by discovering some surprising blacklisted keywords such as l�‡ (conversion rate), �„K— (Mein Kampf), and ýE0(NfT�� (International geological scientific federation (Beijing)).

[1]  V. Paxson End-to-end routing behavior in the internet , 2006, CCRV.

[2]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[3]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[4]  Wael Hassan Simplified Wrapper and Interface Generator , 2000 .

[5]  C. Liang Red Light, Green Light: Has China Achieved Its Goals through the 2000 Internet Regulations? , 2001 .

[6]  Vern Paxson,et al.  How to Own the Internet in Your Spare Time , 2002, USENIX Security Symposium.

[7]  Michael S. Chase,et al.  You've Got Dissent!: Chinese Dissident Use of the Internet and Beijing's Counter-Strategies , 2002 .

[8]  Andreas Zeller,et al.  Simplifying and Isolating Failure-Inducing Input , 2002, IEEE Trans. Software Eng..

[9]  Jia Wang,et al.  Towards an accurate AS-level traceroute tool , 2003, SIGCOMM '03.

[10]  Maximillian Dornseif,et al.  Government mandated blocking of foreign Web content , 2004, DFN-Arbeitstagung über Kommunikationsnetze.

[11]  Benjamin Edelman,et al.  Internet Filtering in China , 2003, IEEE Internet Comput..

[12]  Andi Wu,et al.  Customizable Segmentation of Morphologically Derived Words in Chinese , 2003, Int. J. Comput. Linguistics Chin. Lang. Process..

[13]  John Langford,et al.  Telling humans and computers apart automatically , 2004, CACM.

[14]  Westone,et al.  Home Page , 2004, 2022 2nd International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA).

[15]  Jaideep Chandrashekar,et al.  On Properties of Internet Exchange Points and Their Impact on AS Topology and Relationship , 2004, NETWORKING.

[16]  Ronald J. Deibert,et al.  Internet Filtering in China in 2004-2005: A Country Study , 2005 .

[17]  Richard Clayton,et al.  Failures in a Hybrid Content Blocking System , 2005, Privacy Enhancing Technologies.

[18]  Weblog Wikipedia,et al.  In Wikipedia the Free Encyclopedia , 2005 .

[19]  Francesco Romani,et al.  Ranking a stream of news , 2005, WWW '05.

[20]  George Danezis,et al.  Economics of Information Security , 2005 .

[21]  Robert N. M. Watson,et al.  Ignoring the Great Firewall of China , 2006, Privacy Enhancing Technologies.

[22]  Steven J. Murdoch,et al.  Sampled Traffic Analysis by Internet-Exchange-Level Adversaries , 2007, Privacy Enhancing Technologies.

[23]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .