A fast big data collection system using MapReduce framework

Social network like a corpus with valuable data, has attracted much attention from a various fields of researchers in recent years, especially in the subject of big data analytics. However, as the foundation, the part of efficient and accurate data collection has not been focused much in the past published works. During the data among the web increasing rapidly, this article will identify two major challenges that traditional distributed based web crawler systems cannot adapt, which is fast handling the big data in social networks and suiting for multiple web sources with a uniformed collecting model. To deal with these two challenges thus to build a foundation of the big data analytics, this article will propose an Ontology based adapted web crawler system called OACM system, which uses MapReduce model to effectively balance the processing resources thus to fasten the processing speed of the collection procedure and designs a uniformed Ontology model to estimate the semantic content of both social networks and collecting tasks to adapt different web sources. During a set of experiments, the proposed OACM system could optimize the system resource scheduling efficiently and could achieve the task of collecting large amount of data from multiple web sources.