Estimating Median in The Multi-sourced Heterogeneous Data Set: A distributed implementation

The continuous running of enterprise applications produces a huge volume of business data that reside in different storage and system environments and owned by different companies or organizations, which forms a typical distributed, multi-sourced heterogeneous dataset. The multi-sourced heterogeneous data set provides big potential values for official statistics. While the median is a commonly used indicator in official statistics, it is not a trivial task to estimate the median in the distributed computing environment of multi-sourced heterogeneous data set due to its mathematical nature. In this paper, we proposed a distributed method to estimate the median value for the multi-sourced heterogeneous data set. Mainly considering the different size of multi-sourced data set and unevenly distribution of their data values, we first improve the traditional interpolation based median estimation method for the multi-sourced heterogeneous data set. Then, we propose distributed implementation for the proposed median estimation method based on web service technology. Finally, we evaluate the accuracy and performance of proposed method through experimental study. CCS Concepts Information systems ➝ Data management systems ➝ Information integration ➝ Mediators and data integration