Seeking Nonsense, Looking for Trouble: Efficient Promotional-Infection Detection through Semantic Inconsistency Search

Promotional infection is an attack in which the adversary exploits a website's weakness to inject illicit advertising content. Detection of such an infection is challenging due to its similarity to legitimate advertising activities. An interesting observation we make in our research is that such an attack almost always incurs a great semantic gap between the infected domain (e.g., a university site) and the content it promotes (e.g., selling cheap viagra). Exploiting this gap, we developed a semantic-based technique, called Semantic Inconsistency Search (SEISE), for efficient and accurate detection of the promotional injections on sponsored top-level domains (sTLD) with explicit semantic meanings. Our approach utilizes Natural Language Processing (NLP) to identify the bad terms (those related to illicit activities like fake drug selling, etc.) most irrelevant to an sTLD's semantics. These terms, which we call irrelevant bad terms (IBTs), are used to query search engines under the sTLD for suspicious domains. Through a semantic analysis on the results page returned by the search engines, SEISE is able to detect those truly infected sites and automatically collect new IBTs from the titles/URLs/snippets of their search result items for finding new infections. Running on 403 sTLDs with an initial 30 seed IBTs, SEISE analyzed 100K fully qualified domain names (FQDN), and along the way automatically gathered nearly 600 IBTs. In the end, our approach detected 11K infected FQDN with a false detection rate of 1.5% and over 90% coverage. Our study shows that by effective detection of infected sTLDs, the bar to promotion infections can be substantially raised, since other non-sTLD vulnerable domains typically have much lower Alexa ranks and are therefore much less attractive for underground advertising. Our findings further bring to light the stunning impacts of such promotional attacks, which compromise FQDNs under 3% of .edu, .gov domains and over one thousand gov.cn domains, including those of leading universities such as stanford.edu, mit.edu, princeton.edu, havard.edu and government institutes such as nsf.gov and nih.gov. We further demonstrate the potential to extend our current technique to protect generic domains such as .com and .org.

[1]  Tyler Moore,et al.  Fashion crimes: trending-term exploitation on the web , 2011, CCS '11.

[2]  Lawrence K. Saul,et al.  Knock it off: profiling the online storefronts of counterfeit merchandise , 2014, KDD.

[3]  Zhou Li,et al.  Hunting the Red Fox Online: Understanding and Detection of Mass Redirect-Script Injections , 2014, 2014 IEEE Symposium on Security and Privacy.

[4]  Paolo Milani Comparetti,et al.  EvilSeed: A Guided Approach to Finding Malicious Web Pages , 2012, 2012 IEEE Symposium on Security and Privacy.

[5]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[6]  Roger Garside,et al.  A hybrid grammatical tagger: CLAWS4 , 1997 .

[7]  Информатика Public Suffix List , 2010 .

[8]  Tyler Moore,et al.  Measuring and Analyzing Search-Redirection Attacks in the Illicit Online Prescription Drug Trade , 2011, USENIX Security Symposium.

[9]  Nianwen Xu,et al.  Chinese Word Segmentation as Character Tagging , 2003, Int. J. Comput. Linguistics Chin. Lang. Process..

[10]  Geoffrey Leech,et al.  Corpus Annotation: Linguistic Information from Computer Text Corpora , 1997 .

[11]  Christopher Krügel,et al.  Meerkat: Detecting Website Defacements through Image-based Object Recognition , 2015, USENIX Security Symposium.

[12]  Nicolas Christin,et al.  Automatically Detecting Vulnerable Websites Before They Turn Malicious , 2014, USENIX Security Symposium.

[13]  Lawrence K. Saul,et al.  Search + Seizure: The Effectiveness of Interventions on SEO Campaigns , 2014, Internet Measurement Conference.

[14]  Stefan Savage,et al.  Juice: A Longitudinal Study of an SEO Botnet , 2013, NDSS.

[15]  Tyler Moore,et al.  A Nearly Four-Year Longitudinal Study of Search-Engine Poisoning , 2014, CCS.

[16]  Oliver Hinz,et al.  An analysis of the importance of the long tail in search engine marketing , 2010, Electron. Commer. Res. Appl..

[17]  Christopher Krügel,et al.  Delta: automatic identification of unknown web-based infection campaigns , 2013, CCS.

[18]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[19]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[20]  Gianluca Stringhini,et al.  Shady paths: leveraging surfing crowds to detect malicious web pages , 2013, CCS.