ChainMR Crawler: A Distributed Vertical Crawler Based on MapReduce

With the explosive growth of data in the Internet, the single vertical crawler cannot meet the requirements of the high performance of the crawler. The existing distributed vertical crawlers also have the problem of weak capability of customization. In order to solve the above problem, this paper proposes a distributed vertical crawler named ChainMR Crawler. We adopt ChainMapper/ChainReducer model to design each module of the crawler, use Redis to manage URLs and choose the distributed database Hbase to store the key content of web pages. Experimental results demonstrate that the efficiency of ChainMR Crawler is 6 % higher than Nutch in the field of vertical crawler, which achieves the expected effect.

[1]  Koichi Takeda,et al.  Information retrieval on the web , 2000, CSUR.

[2]  Ming Xian,et al.  Optimization of Distributed Crawler under Hadoop , 2015 .

[3]  Sebastiano Vigna,et al.  BUbiNG: massive crawling for the masses , 2014, WWW.

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Bai Wang,et al.  Community Mining in Complex Network Based on Parallel Genetic Algorithm , 2010, 2010 Fourth International Conference on Genetic and Evolutionary Computing.

[6]  Bing Zhou,et al.  A distributed vertical crawler using crawling-period based strategy , 2010, 2010 2nd International Conference on Future Computer and Communication.

[7]  Zhang Yang-sen Key technologies of distributed search engine based on Hadoop , 2011 .

[8]  Yan Guo,et al.  Board Forum Crawling: A Web Crawling Method for Web Forum , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[9]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).