Double dip map-reduce for processing cross validation jobs

Cross validation is fundamental to machine learning as it provides a reliable way in which to evaluate algorithms and the overall quality of the corpora in use. In typical cross validation, the corpus is initially divided into learning and training segments, then crossed-over in successive rounds, so that each data segment is validated against the remaining ones. This process is prohibitively time and effort consuming, and often brushed off for computationally cheaper ones, such as heuristics. In this paper we introduce a cloud-based architecture for running cross validation jobs. Our solution makes heavy use of computational resources in the cloud by proposing a strategy in which there are two distinct, subsequent, map-reduce cycles: the first to perform the algorithmic target computation, and the second to provide cross validation data to retrofit the machine learning process. We demonstrate the feasibility of the proposed approach, with the implementation of a web segmentation algorithm.