论文信息 - The Research of Web Parallel Information Extraction Based on Hadoop

The Research of Web Parallel Information Extraction Based on Hadoop

Big data that are driven by three major trends such as cloud computing, social computing, and mobile computing are reshaping the business process, IT infrastructure and our capture of the enterprise, customer and Internet information and use. To extract the big data in the Internet, the enterprise needs a scalable, flexible, and manageable data infrastructure. Therefore, this paper is based on the Hadoop framework, to analyze and design the large data information extraction system. Measurement shows that the huge amounts of data extraction on the basis of cluster have great improvement in performance compared with single extraction, with high reliability and scalability. What is more? The research of this paper will provide better technical solutions to Web information extraction and sensitive information.

Quan Shi | Lu Xu | Ma Songyu

[1] Wilson C. Hsieh,et al. Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[2] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3] Yanpei Chen,et al. Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[4] Barbara Di Eugenio,et al. A Lucene and Maximum Entropy Model Based Hedge Detection System , 2010, CoNLL Shared Task.