Cross-lingual web spam classification

While Web spam training data exists in English, we face an expensive human labeling procedure if we want to filter a Web domain in a different language. In this paper we overview how existing content and link based classification techniques work, how models can be "translated" from English into another language, and how language-dependent and independent methods combine. In particular we show that simple bag-of-words translation works very well and in this procedure we may also rely on mixed language Web hosts, i.e. those that contain an English translation of part of the local language text. Our experiments are conducted on the ClueWeb09 corpus as the training English collection and a large Portuguese crawl of the Portuguese Web Archive. To foster further research, we provide labels and precomputed values of term frequencies, content and link based features for both ClueWeb09 and the Portuguese data.

[1]  Ryan Shaun Joazeiro de Baker,et al.  Case studies in the use of ROC curve analysis for sensor-based estimates in human computer interaction , 2005, Graphics Interface.

[2]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[3]  Amit Singhal,et al.  Challenges in running a commercial search engine , 2005, SIGIR '05.

[4]  Qiang Yang,et al.  Can chinese web pages be classified with english data source? , 2008, WWW.

[5]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[6]  Ian Witten,et al.  Data Mining , 2000 .

[7]  András A. Benczúr,et al.  SpamRank -- Fully Automatic Link Spam Detection , 2005, AIRWeb.

[8]  Lei Shi,et al.  Cross Language Text Classification by Model Translation and Semi-Supervised Learning , 2010, EMNLP.

[9]  Benno Stein,et al.  Cross-Language Text Classification Using Structural Correspondence Learning , 2010, ACL.

[10]  Kumar Chellapilla,et al.  Fourth international workshop on adversarial information retrieval on the web (AIRWeb 2008) , 2008, WWW.

[11]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[12]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[13]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[14]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[15]  András A. Benczúr,et al.  Content-based trust and bias classification via biclustering , 2012, WebQuality '12.

[16]  Ben Taskar,et al.  Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning) , 2007 .

[17]  Michael L. Littman,et al.  Automatic Cross-Language Retrieval Using Latent Semantic Indexing , 1997 .

[18]  Charles L. A. Clarke,et al.  Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.

[19]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[20]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[21]  Blaz Zupan,et al.  Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..

[22]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[23]  Hector Garcia-Molina,et al.  Spam: it's not just for inboxes anymore , 2005, Computer.

[24]  Brian D. Davison,et al.  Topical TrustRank: using topicality to combat web spam , 2006, WWW '06.

[25]  Miguel Costa,et al.  Introducing the Portuguese web archive initiative , 2008 .

[26]  Marc Najork,et al.  Detecting phrase-level duplication on the world wide web , 2005, SIGIR '05.

[27]  Zoltan Gyongyi,et al.  AIRWeb 2009, Fifth International Workshop on Adversarial Information Retrieval on the Web, Madrid, Spain, April 21, 2009 , 2009, AIRWeb.

[28]  Núria Bel,et al.  Cross-Lingual Text Categorization , 2003, ECDL.

[29]  András A. Benczúr,et al.  Web spam challenge proposal for filtering in archives , 2009, AIRWeb '09.

[30]  Eneko Agirre,et al.  Advances in Multilingual and Multimodal Information Retrieval. , 2008 .

[31]  Gerhard Weikum,et al.  Cross-lingual Data Quality for Knowledge Base Acceleration Across Wikipedia Editions , 2012, QDB 2012.

[32]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[33]  Xiaojun Wan,et al.  Co-Training for Cross-Lingual Sentiment Classification , 2009, ACL.

[34]  Abhishek Mathur,et al.  Content based web spam detection using naive bayes with different feature representation technique , 2013 .

[35]  Panos Constantopoulos,et al.  Research and Advanced Technology for Digital Libraries , 2001, Lecture Notes in Computer Science.

[36]  Jian Hu,et al.  Cross lingual text classification by mining multilingual topics from wikipedia , 2011, WSDM '11.

[37]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[38]  Carol Peters,et al.  CLEF 2007: Ad Hoc Track Overview , 2008, CLEF.

[39]  Ben Taskar,et al.  Introduction to statistical relational learning , 2007 .

[40]  David Pinto,et al.  Using Information from the Target Language to Improve Crosslingual Text Classification , 2010, IceTAL.

[41]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[42]  Xinchang Zhang,et al.  Evaluating Web Content Quality via Multi-scale Features , 2013, ArXiv.

[43]  András A. Benczúr,et al.  Web spam classification: a few features worth more , 2011, WebQuality '11.

[44]  David M. Pennock,et al.  Using web structure for classifying and describing web pages , 2002, WWW.

[45]  Brian D. Davison,et al.  Web Spam Challenge , 2007 .

[46]  Joseph Olive,et al.  Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation , 2011 .

[47]  Miguel Costa,et al.  A Survey on Web Archiving Initiatives , 2011, TPDL.

[48]  Ludovic Denoyer,et al.  MADSPAM Consortium at the ECML/PKDD Discovery Challenge 2010 , 2010 .

[49]  Ludovic Denoyer,et al.  Web spam challenge 2008 , 2008, AIRWeb 2008.

[50]  Marco Maggini,et al.  An EM based training algorithm for cross-language text categorization , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[51]  Feiping Nie,et al.  Cross-language web page classification via dual knowledge transfer using nonnegative matrix tri-factorization , 2011, SIGIR.

[52]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.

[53]  Philipp Cimiano,et al.  Enriching the crosslingual link structure of Wikipedia - A classification-based approach , 2008, AAAI 2008.