A Comparative Live Evaluation of Multileaving Methods on a Commercial cQA Search

We present one of the world's first attempts to examine the feasibility of multileaving evaluation of document rankings on a large scale commercial community Question Answering (cQA) service. As a natural enhancement of interleaving evaluation, multileaving merges more than two input rankings into one and measures the search user satisfaction of each input ranking on the basis of user clicks on the multileaved ranking. We evaluated the adequateness of two major multileaving methods, team draft multileaving (TDM) and optimized multileaving (OM), proposing their practical implementation for live services. Our experimental results demonstrated that multileaving methods could precisely evaluate the effectiveness of five rankings with different quality by using clicks from real users. Moreover, we concluded that OM is more efficient than TDM by observing that most of the evaluation results with OM converged after showing multileaved rankings around 40,000 times and an in-depth analysis of their characteristics.