Detecting Associations in Large Dataset on MapReduce

In daily life, we are surrounded by all kinds of data. How to find the relationship between these data has become one of the most challenges before the data scientists. In 2011, David N. Reshef etc. took a great leap on solving this problem. They has proved that maximal information coefficient(mic) is an effective tool to detect different kinds of relationships between any given variable pairs no matter these relationships are functional or not. However, challenges remained because the computation procedure is too complex and time-consuming for large dataset and make this algorithm not possible to work in reality. In this paper, we explore the possible parallel ways to detect the associations between variables in large dataset, and propose a high performance MapReduce based solution, which includes data storage pattern, preprocessing algorithms, distributed memory cache mechanism, and a serial of MapReduce jobs. The experiments show that our parallel solution provide a linear speedup comparing with original algorithm without affecting the correctness. The work done in this paper makes the famous mic algorithm more practical in solving real problem.