A Novel Combine Forecasting Method for Predicting News Update Time

With the rapid development of Internet, information provided by the Internet has shown explosive growth. In the face of massive and constantly updated information on the Internet, how the user can fast access to more valuable and more information has become one of the hot spots. The time of Web Page update appears to be erratic, so forecasting the update time of news reports is even more difficult. From the view of application, we can use mathematical models to maximize the approximation of variation, although it cannot be completely accurate. So is the predicting the update time of news which helps in improving the news crawler’s scheduling policy. In this paper, we proposed a combined predict algorithm for news update. In order to predict the update time of news, firstly, we applied the Exponential Smoothing method to our dataset, and we also have selected the optimal parameters. Secondly, we leveraged the Naive Bayes Model for prediction. Finally, we combined two methods for Combination Forecasting, as well as made a compare with former methods. Through the experiments on Sohu News, we show that Combination Forecasting method outperforms other methods while estimating localized rate of updates.

[1]  Filippo Menczer,et al.  Topical Crawling for Business Intelligence , 2003, ECDL.

[2]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[3]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[4]  Frank M. Shipman,et al.  Perception of content, structure, and presentation changes in Web-based hypertext , 2001, Hypertext.

[5]  Judit Bar-Ban,et al.  Search Engine Ability to Cope With the Changing Web , 2004 .

[6]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[7]  Ashutosh Dixit,et al.  A mathematical model for crawler revisit frequency , 2010, 2010 IEEE 2nd International Advance Computing Conference (IACC).

[8]  Hongfei Yan,et al.  The Evolution of Link-Attributes for Pages and Its Implications on Web Crawling , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[9]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[10]  Filippo Menczer,et al.  Topical web crawlers: Evaluating adaptive algorithms , 2004, TOIT.

[11]  J. Curran,et al.  Domain-specific Web site identification: the CROSSMARC focused Web crawler , 2003 .

[12]  Marc Najork,et al.  On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[13]  George Cybenko,et al.  Keeping up with the changing Web , 2000, Computer.

[14]  Ashutosh,et al.  Design of A Priority Based Frequency Regulated Incremental Crawler , 2014 .

[15]  Filippo Menczer,et al.  Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web , 2000, Machine Learning.

[16]  Thompson S. H. Teo,et al.  Assessing the impact of using the Internet for competitive intelligence , 2001, Inf. Manag..

[17]  Judit Bar-Ilan,et al.  Evolution, continuity, and disappearance of documents on a specific topic on the Web: A longitudinal study of informetrics , 2004, J. Assoc. Inf. Sci. Technol..

[18]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.