Soft-404 Pages, A Crawling Problem
暂无分享,去创建一个
[1] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .
[2] Sung-Ryul Kim,et al. Detecting soft errors by redirection classification , 2009, WWW '09.
[3] Susan T. Dumais,et al. A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.
[4] William C. Schmidt,et al. World-Wide Web survey research: Benefits, potential problems, and solutions , 1997 .
[5] Marc Najork,et al. Detecting spam web pages through content analysis , 2006, WWW '06.
[6] Idit Keidar,et al. Do not crawl in the dust: different urls with similar text , 2006, WWW '07.
[7] Frank M. Shipman,et al. Identifying "Soft 404" Error Pages: Analyzing the Lexical Signatures of Documents in Distributed Collections , 2012, TPDL.
[8] Hector Garcia-Molina,et al. Web Spam Taxonomy , 2005, AIRWeb.
[9] Ricardo A. Baeza-Yates,et al. Characterization of national Web domains , 2007, TOIT.
[10] Setsuo Ohsuga,et al. INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES , 1977 .
[11] B. Huberman,et al. The Deep Web : Surfacing Hidden Value , 2000 .
[12] J. Ross Quinlan,et al. Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..
[13] Sriram Raghavan,et al. Crawling the Hidden Web , 2001, VLDB.
[14] Marc Najork,et al. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.
[15] Ricardo Baeza-Yates,et al. Modern Information Retrieval - the concepts and technology behind search, Second edition , 2011 .
[16] Ian H. Witten,et al. The WEKA data mining software: an update , 2009, SKDD.
[17] Kumar Chellapilla,et al. A taxonomy of JavaScript redirection spam , 2007, AIRWeb '07.
[18] Marc Najork,et al. Detecting phrase-level duplication on the world wide web , 2005, SIGIR '05.
[19] Marc Najork,et al. Web Crawling , 2010, Found. Trends Inf. Retr..
[20] Hector Garcia-Molina,et al. Link Spam Alliances , 2005, VLDB.
[21] Idit Keidar,et al. Do not crawl in the DUST: different URLs with similar text , 2006, WWW.
[22] Chabane Djeraba,et al. High performance crawling system , 2004, MIR '04.
[23] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.
[24] Kentaro Inui,et al. Development of a large-scale web crawler and search engine infrastructure , 2009, IUCS '09.
[25] Ron Kohavi,et al. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.
[26] José María Gómez Hidalgo,et al. Evaluating cost-sensitive Unsolicited Bulk Email categorization , 2002, SAC '02.
[27] Rajeev Motwani,et al. Stratified Planning , 2009, IJCAI.
[28] Brian D. Davison,et al. Adversarial Web Search , 2011, Found. Trends Inf. Retr..
[29] David Carmel,et al. The connectivity sonar: detecting site functionality by structural patterns , 2003, HYPERTEXT '03.
[30] Ramesh Govindan,et al. Making Eigenvector-Based Reputation Systems Robust to Collusion , 2004, WAW.
[31] Andrei Z. Broder,et al. Sic transit gloria telae: towards an understanding of the web's decay , 2004, WWW '04.
[32] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.
[33] Ricardo Baeza Yates,et al. Characteristics of the Web of Spain , 2005 .
[34] Donghua Pan,et al. Web Page Content Extraction Method Based on Link Density and Statistic , 2008, 2008 4th International Conference on Wireless Communications, Networking and Mobile Computing.
[35] Berkant Barla Cambazoglu,et al. Scalability Challenges in Web Search Engines , 2015, Advanced Topics in Information Retrieval.
[36] Antonio Gulli,et al. The indexable web is more than 11.5 billion pages , 2005, WWW '05.
[37] Christophe Bisciglia,et al. Cluster computing for web-scale data processing , 2008, SIGCSE '08.
[38] J. Ross Quinlan,et al. Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.
[39] Brian D. Davison,et al. Cloaking and Redirection: A Preliminary Study , 2005, AIRWeb.
[40] Brian D. Davison,et al. Identifying link farm spam pages , 2005, WWW '05.
[41] András A. Benczúr,et al. SpamRank -- Fully Automatic Link Spam Detection , 2005, AIRWeb.