论文信息 - Parallelizing ListNet training using spark

Parallelizing ListNet training using spark

As ever-larger training sets for learning to rank are created, scalability of learning has become increasingly important to achieving continuing improvements in ranking accuracy. Exploiting independence of "summation form" computations, we show how each iteration in ListNet gradient descent can benefit from parallel execution. We seek to draw the attention of the IR community to use Spark, a newly introduced distributed cluster computing system, for reducing training time of iterative learning to rank algorithms. Unlike MapReduce, Spark is especially suited for iterative and interactive algorithms. Our results show near linear reduction in ListNet training time using Spark on Amazon EC2 clusters.

[1] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[2] Thomas Hofmann,et al. Map-Reduce for Machine Learning on Multicore , 2007 .

[3] Tie-Yan Liu,et al. Learning to rank: from pairwise approach to listwise approach , 2007, ICML '07.

[4] Sanjay Ghemawat,et al. MapReduce: simplified data processing on large clusters , 2008, CACM.

[5] Scott Shenker,et al. Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[6] Ke Wang,et al. Parallel learning to rank for information retrieval , 2011, SIGIR.

[7] Tie-Yan Liu,et al. Future directions in learning to rank , 2010, Yahoo! Learning to Rank Challenge.

[8] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.