Ads-portal domains: Identification and measurements

An ads-portal domain refers to a Web domain that shows only advertisements, served by a third-party advertisement syndication service, in the form of ads listing. We develop a machine-learning-based classifier to identify ads-portal domains, which has 96% accuracy. We use this classifier to measure the prevalence of ads-portal domains on the Internet. Surprisingly, 28.3/25% of the (two-level) *.com/*.net web domains are ads-portal domains. Also, 41/39.8% of *.com/*.net ads-portal domains are typos of well-known domains, also known as typo-squatting domains. In addition, we use the classifier along with DNS trace files to estimate how often Internet users visit ads-portal domains. It turns out that ∼5% of the two-level *.com, *.net, *.org, *.biz and *.info web domains on the traces are ads-portal domains and ∼50% of these accessed ads-portal domains are typos. These numbers show that ads-portal domains and typo-squatting ads-portal domains are prevalent on the Internet and successful in attracting many visits. Our classifier represents a step towards better categorizing the web documents. It can also be helpful to search engines ranking algorithms, helpful in identifying web spams that redirects to ads-portal domains, and used to discourage access to typo-squatting ads-portal domains.

[1]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[2]  Paul V. Mockapetris,et al.  Domain names - implementation and specification , 1987, RFC.

[3]  S. Eguchi,et al.  An introduction to the predictive technique AdaBoost with a comparison to generalized additive models , 2005 .

[4]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[5]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[6]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[7]  Yi-Min Wang,et al.  Strider Typo-Patrol: Discovery and Analysis of Systematic Typo-Squatting , 2006, SRUTI.

[8]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[9]  Nicholas Kushmerick,et al.  Learning to remove Internet advertisements , 1999, AGENTS '99.

[10]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[13]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[14]  Ron Kohavi,et al.  The Power of Decision Tables , 1995, ECML.

[15]  小新 Google Adsense收入 乱账细算 , 2008 .

[16]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[17]  Michalis Faloutsos,et al.  Cyber-Fraud is One Typo Away , 2008, IEEE INFOCOM 2008 - The 27th Conference on Computer Communications.

[18]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[19]  Richard Nock,et al.  Adaptive filtering of advertisements on web pages , 2005, WWW '05.

[20]  Thorsten Joachims,et al.  A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.

[21]  Pat Langley,et al.  Induction of One-Level Decision Trees , 1992, ML.

[22]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[23]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[24]  Ian Witten,et al.  Data Mining , 2000 .

[25]  痛并快乐着 McAfee SiteAdvisor,让我们的搜索更安全 , 2007 .

[26]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[27]  Hao Chen,et al.  Spam double-funnel: connecting web spammers with advertisers , 2007, WWW '07.

[28]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[29]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.