Map reduce programming model: Construction of inverted index for automated document clustering

Inverted index is an important data structure used in Information Retrieval operation, which enable all retrieval engines to easily facilitate full-text search. In this paper, Map Reduce algorithm is used for the construction of inverted index, so as to enable it to work in a parallelized manner and also make the data structure to support large scale document corpora. Here, we have considered crime articles related to women and children drawn from various English newspapers all over India. The paper aims to cluster the news articles according to a specific type of crime committed in a parallelized manner. We proposed a Hadoop based framework integrated with R environment that preprocess the corpus and stores the news articles and process it with the Map Reduce algorithm which identifies the type of crime like sexual harassment, physical abuse, emotional abuse, rape, murder and cluster it accordingly. We observe that the proposed method outperforms the other conventional methods and is more suited for batch processing applications.