Soft-404 Pages, A Crawling Problem

During its traversal of the Web, crawler systems have to deal with multiple challenges. Some of them are related with detecting garbage content to avoid wasting resources processing it. Soft-404 pages are a type of garbage content generated when some web servers do not use the appropriate HTTP response code for death links making them to be incorrectly identified. Our analysis of the Web has revealed that 7.35% of web servers send a 200 HTTP code when a request for an unknown document is received, instead of a 404 code, which indicates that the document is not found. This paper presents a system called Soft404Detector, based on web content analysis to identify web pages that are Soft-404 pages. Our system uses a set of content-based heuristics and combines them with a C4.5 classifier. For testing purposes, we built a Soft-404 pages dataset. Our experiments indicate that our system is very effective, achieving a precision of 0.992 and a recall of 0.980 at Soft-404 pages. Subject Categories and Descriptors:

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Sung-Ryul Kim,et al.  Detecting soft errors by redirection classification , 2009, WWW '09.

[3]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[4]  William C. Schmidt,et al.  World-Wide Web survey research: Benefits, potential problems, and solutions , 1997 .

[5]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[6]  Idit Keidar,et al.  Do not crawl in the dust: different urls with similar text , 2006, WWW '07.

[7]  Frank M. Shipman,et al.  Identifying "Soft 404" Error Pages: Analyzing the Lexical Signatures of Documents in Distributed Collections , 2012, TPDL.

[8]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[9]  Ricardo A. Baeza-Yates,et al.  Characterization of national Web domains , 2007, TOIT.

[10]  Setsuo Ohsuga,et al.  INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES , 1977 .

[11]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[12]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[13]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[14]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[15]  Ricardo Baeza-Yates,et al.  Modern Information Retrieval - the concepts and technology behind search, Second edition , 2011 .

[16]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[17]  Kumar Chellapilla,et al.  A taxonomy of JavaScript redirection spam , 2007, AIRWeb '07.

[18]  Marc Najork,et al.  Detecting phrase-level duplication on the world wide web , 2005, SIGIR '05.

[19]  Marc Najork,et al.  Web Crawling , 2010, Found. Trends Inf. Retr..

[20]  Hector Garcia-Molina,et al.  Link Spam Alliances , 2005, VLDB.

[21]  Idit Keidar,et al.  Do not crawl in the DUST: different URLs with similar text , 2006, WWW.

[22]  Chabane Djeraba,et al.  High performance crawling system , 2004, MIR '04.

[23]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[24]  Kentaro Inui,et al.  Development of a large-scale web crawler and search engine infrastructure , 2009, IUCS '09.

[25]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[26]  José María Gómez Hidalgo,et al.  Evaluating cost-sensitive Unsolicited Bulk Email categorization , 2002, SAC '02.

[27]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[28]  Brian D. Davison,et al.  Adversarial Web Search , 2011, Found. Trends Inf. Retr..

[29]  David Carmel,et al.  The connectivity sonar: detecting site functionality by structural patterns , 2003, HYPERTEXT '03.

[30]  Ramesh Govindan,et al.  Making Eigenvector-Based Reputation Systems Robust to Collusion , 2004, WAW.

[31]  Andrei Z. Broder,et al.  Sic transit gloria telae: towards an understanding of the web's decay , 2004, WWW '04.

[32]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[33]  Ricardo Baeza Yates,et al.  Characteristics of the Web of Spain , 2005 .

[34]  Donghua Pan,et al.  Web Page Content Extraction Method Based on Link Density and Statistic , 2008, 2008 4th International Conference on Wireless Communications, Networking and Mobile Computing.

[35]  Berkant Barla Cambazoglu,et al.  Scalability Challenges in Web Search Engines , 2015, Advanced Topics in Information Retrieval.

[36]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[37]  Christophe Bisciglia,et al.  Cluster computing for web-scale data processing , 2008, SIGCSE '08.

[38]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[39]  Brian D. Davison,et al.  Cloaking and Redirection: A Preliminary Study , 2005, AIRWeb.

[40]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[41]  András A. Benczúr,et al.  SpamRank -- Fully Automatic Link Spam Detection , 2005, AIRWeb.