论文信息 - Map reduce programming model: Construction of inverted index for automated document clustering

Map reduce programming model: Construction of inverted index for automated document clustering

Inverted index is an important data structure used in Information Retrieval operation, which enable all retrieval engines to easily facilitate full-text search. In this paper, Map Reduce algorithm is used for the construction of inverted index, so as to enable it to work in a parallelized manner and also make the data structure to support large scale document corpora. Here, we have considered crime articles related to women and children drawn from various English newspapers all over India. The paper aims to cluster the news articles according to a specific type of crime committed in a parallelized manner. We proposed a Hadoop based framework integrated with R environment that preprocess the corpus and stores the news articles and process it with the Map Reduce algorithm which identifies the type of crime like sexual harassment, physical abuse, emotional abuse, rape, murder and cluster it accordingly. We observe that the proposed method outperforms the other conventional methods and is more suited for batch processing applications.

V. Bhuvaneswari | K. Santhiya | V. Bhuvaneswari | K. Santhiya

[1] V Adarsh Rotte. BIG DATA ANALYTICS MADE EASY WITH RHADOOP , 2015 .

[2] Hairong Kuang,et al. The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[3] Jimmy J. Lin,et al. Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[4] Martin Hilbert,et al. The World’s Technological Capacity to Store, Communicate, and Compute Information , 2011, Science.

[5] C. L. Philip Chen,et al. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[6] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7] M. Ellsberg,et al. Researching violence against women: a practical guide for researchers and activists , 2005 .

[8] Murtaza Haider,et al. Beyond the hype: Big data concepts, methods, and analytics , 2015, Int. J. Inf. Manag..

[9] D. Butler. Data, data everywhere... , 2005, Nature Structural &Molecular Biology.