Analysis and detection of Soft-404 pages

The WWW is continuously growing, but sometimes, not in the best way due to the proliferation of garbage contents, such as Web Spam pages, duplicate content or dead links. Some web servers do not always use the appropriate HTTP response code for dead links making them to be incorrectly identified, producing a problem for search engines. Our analysis has revealed that 7.35% of web servers send a 200 HTTP code when a request for an unknown document is received, instead of a 404 code, which indicates that the document is not found. These web pages are known as Soft-404 pages. Soft-404 pages are a problem for search engines, and their crawling modules, which process and index these pages, with the consequent loss of resources. There are few studies that analyse this problem and try to solve it. In this article we propose a new detection system for Soft-404 pages, called Soft404Detector, which uses a set of content-based heuristics and combines them with a C4.5 classifier. For testing purposes, we built a Soft-404 pages dataset. Our experiments indicate that our system is very effective, achieving a precision of 0.992 and a recall of 0.980 at Soft-404 pages.

[1]  Ricardo A. Baeza-Yates,et al.  Characterization of national Web domains , 2007, TOIT.

[2]  Ricardo Baeza-Yates,et al.  Modern Information Retrieval - the concepts and technology behind search, Second edition , 2011 .

[3]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[4]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[5]  Christophe Bisciglia,et al.  Cluster computing for web-scale data processing , 2008, SIGCSE '08.

[6]  William C. Schmidt,et al.  World-Wide Web survey research: Benefits, potential problems, and solutions , 1997 .

[7]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[8]  Chabane Djeraba,et al.  High performance crawling system , 2004, MIR '04.

[9]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[10]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[11]  Brian D. Davison,et al.  Adversarial Web Search , 2011, Found. Trends Inf. Retr..

[12]  Donghua Pan,et al.  Web Page Content Extraction Method Based on Link Density and Statistic , 2008, 2008 4th International Conference on Wireless Communications, Networking and Mobile Computing.

[13]  Andrei Z. Broder,et al.  Sic transit gloria telae: towards an understanding of the web's decay , 2004, WWW '04.

[14]  Idit Keidar,et al.  Do not crawl in the DUST: different URLs with similar text , 2006, WWW.

[15]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[16]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[17]  Frank M. Shipman,et al.  Identifying "Soft 404" Error Pages: Analyzing the Lexical Signatures of Documents in Distributed Collections , 2012, TPDL.

[18]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[19]  Kentaro Inui,et al.  Development of a large-scale web crawler and search engine infrastructure , 2009, IUCS '09.

[20]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[21]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[22]  Sung-Ryul Kim,et al.  Detecting soft errors by redirection classification , 2009, WWW '09.

[23]  Michael K. Bergman White Paper: The Deep Web: Surfacing Hidden Value , 2001 .