Academic Search Engine Spam and Google Scholar's Resilience Against it

In a previous paper we provided guidelines for scholars on optimizing research articles for academic search engines such as Google Scholar. Feedback in the academic community to these guidelines was diverse. Some were concerned researchers could use our guidelines to manipulate rankings of scientific articles and promote what we call ‘academic search engine spam’. To find out whether these concerns are justified, we conducted several tests on Google Scholar. The results show that academic search engine spam is indeed— and with little effort—possible: We increased rankings of academic articles on Google Scholar by manipulating their citation counts; Google Scholar indexed invisible text we added to some articles, making papers appear for keyword searches the articles were not relevant for; Google Scholar indexed some nonsensical articles we randomly created with the paper generator SciGen; and Google Scholar linked to manipulated versions of research papers that contained a Viagra advertisement. At the end of this paper, we discuss whether academic search engine spam could become a serious threat to Web-based academic search engines.

[1]  András A. Benczúr,et al.  Link-Based Similarity Search to Fight Web Spam , 2006, AIRWeb.

[2]  F. Havemann,et al.  Einführung in die Bibliometrie , 2009 .

[3]  Tobias Scheffer,et al.  Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam , 2005, ECML.

[4]  Simone Soubusta On Click Fraud , 2008 .

[5]  Lokman I. Meho,et al.  Citation Analysis: A Comparison of Google Scholar, Scopus, and Web of Science , 2007, Proceedings of the American Society for Information Science and Technology.

[6]  Lorie A. Kloda Use Google Scholar, Scopus and Web of Science for Comprehensive Citation Tracking , 2007 .

[7]  András A. Benczúr,et al.  SpamRank -- Fully Automatic Link Spam Detection , 2005, AIRWeb.

[8]  Mourad Touzani,et al.  Ranking marketing journals using the Google Scholar-based hg-index , 2010, J. Informetrics.

[9]  Alan Price,et al.  The great betrayal Fraud in science , 2005 .

[10]  Jöran Beel,et al.  Google Scholar's Ranking Algorithm: The Impact of Articles' Age (An Empirical Study) , 2009, 2009 Sixth International Conference on Information Technology: New Generations.

[11]  Lokman I. Meho,et al.  A New Era in Citation and Bibliometric Analyses: Web of Science, Scopus, and Google Scholar , 2006, ArXiv.

[12]  Jöran Beel,et al.  Google Scholar's ranking algorithm: The impact of citation counts (An empirical study) , 2009, 2009 Third International Conference on Research Challenges in Information Science.

[13]  Péter Jacsó,et al.  Testing the Calculation of a Realistic h-index in Google Scholar, Scopus, and Web of Science for F. W. Lancaster , 2008, Libr. Trends.

[14]  Jöran Beel,et al.  Google Scholar’s Ranking Algorithm : An Introductory Overview , 2009 .

[15]  Baoning Wu,et al.  Extracting link spam using biased random walks from spam seed sets , 2007, AIRWeb '07.

[16]  Jöran Beel,et al.  UbiLoc : A System for Locating Mobile Devices using Mobile Devices , 2004 .

[17]  Chunheng Wang,et al.  Improving Spamdexing Detection Via a Two-Stage Classification Strategy , 2008, AIRS.

[18]  Torsten Suel,et al.  Improving web spam classifiers using link structure , 2007, AIRWeb '07.

[19]  Lokman I. Meho,et al.  Impact of data sources on citation counts and rankings of LIS faculty: Web of science versus scopus and google scholar , 2007 .

[20]  Erik Wilde,et al.  Academic Search Engine Optimization (ASEO) , 2010 .

[21]  Ira Steven Nathenson Internet Infoglut and Invisible Ink: Spamdexing Search Engines with Meta Tags , 1998 .

[22]  Jöran Beel,et al.  Identifying Related Documents For Research Paper Recommender By CPA And COA , 2009, WCE 2009.

[23]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[24]  Mike Thelwall,et al.  Sources of Google Scholar citations outside the Science Citation Index: A comparison between four science disciplines , 2008, Scientometrics.

[25]  Gordan Jezic,et al.  An Auction-Based Semantic Service Discovery Model for E-Commerce Applications , 2006, OTM Workshops.

[26]  Judit Bar-Ilan,et al.  Which h-index? — A comparison of WoS, Scopus and Google Scholar , 2008, Scientometrics.

[27]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[28]  Jöran Beel,et al.  Information retrieval on mind maps - what could it be good for? , 2009, 2009 5th International Conference on Collaborative Computing: Networking, Applications and Worksharing.

[29]  Jöran Bela Erik Beel,et al.  Academic Search Engine Optimization (ASEO ): Optimizing Scholarly Literature for Google Scholar & Co. , 2010 .

[30]  Anne-Wil Harzing,et al.  Google Scholar as a new source for citation analysis , 2008 .

[31]  Markus Jakobsson,et al.  Badvertisements: Stealthy Click-Fraud with Unwitting Accessories , 2006, J. Digit. Forensic Pract..

[32]  Yi Zhu,et al.  Click Fraud , 2009, Mark. Sci..

[33]  Kazuyuki Aihara,et al.  A large-scale study of link spam detection by graph algorithms , 2007, AIRWeb '07.

[34]  Brian D. Davison,et al.  Adversarial information retrieval on the web (AIRWeb 2006) , 2006, SIGF.

[35]  Thomas Lavergne,et al.  Tracking Web Spam with Hidden Style Similarity , 2006, AIRWeb.

[36]  Anne-Wil Harzing,et al.  The publication and citation impact profiles of Angewandte Chemie and the Journal of the American Chemical Society based on the sections of Chemical Abstracts: A case study on the limitations of the Journal Impact Factor , 2009 .

[37]  Otto-von-Guericke Google Scholar ’ s Ranking Algorithm : The Impact of Articles ’ Age ( An Empirical Study ) , 2009 .

[38]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[39]  Hector Garcia-Molina,et al.  Link Spam Alliances , 2005, VLDB.

[40]  F. W. Lancaster,et al.  Testing the Calculation of a Realistic h-index in Google Scholar, Scopus, and Web of Science for , 2008 .

[41]  Lei Wang,et al.  Three options for citation tracking: Google Scholar, Scopus and Web of Science , 2006, Biomedical digital libraries.

[42]  Daniel L. Hadjinian [3ShidlerJLComTech005] Clicking Away the Competition: The Legal Ramifications of Click Fraud for Companies that Offer Pay Per Click Advertising Services , 2006 .

[43]  Alireza Noruzi Google Scholar: The New Generation of Citation Indexes , 2005 .

[44]  Jöran Beel,et al.  On the robustness of google scholar against spam , 2010, HT '10.

[45]  Lokman I. Meho,et al.  Impact of data sources on citation counts and rankings of LIS faculty: Web of science versus scopus and google scholar , 2007, J. Assoc. Inf. Sci. Technol..