论文信息 - Discovering bucket orders from full rankings

Discovering bucket orders from full rankings

Discovering a bucket order B from a collection of possibly noisy full rankings is a fundamental problem that relates to various applications involving rankings. Informally, a bucket order is a total order that allows "ties" between items in a bucket. A bucket order B can be viewed as a "representative" that summarizes a given set of full rankings {T1, T2, ..., Tm}, or conversely B can be an "approximation" of some "ground truth" G where the rankings {T1, T2, ..., Tm} are simply the "linear extensions" of G. Current work of finding bucket orders such as the dynamic programming algorithm is mainly developed from the "representative" perspective, which maximizes items' intra-bucket similarity when forming a bucket. The underlying idea of maximizing intra-bucket similarity is realized via minimizing the sum of the deviations of median ranks within a bucket. In contrast, from the "approximation" perspective, since each observed full ranking Ti is simply a linear extension of the given "ground truth" bucket order G, items in a big bucket b in G are forced to have different median ranks, and as a result b will have a big sum of deviations. Thus, minimizing the sum of deviations may result in an undesirable scenario that big buckets are mostly decomposed into small ones. In this paper, we propose a novel heuristic called Abnormal Rank Gap to capture the inter-bucket dissimilarity for better bucket forming. In addition, we propose to use the "closeness" on multiple quantile ranks to determine if two items should be put into the same bucket. We develop a novel bucket order discovering method termed the Bucket Gap algorithm. Our extensive experiments demonstrate that the Bucket Gap algorithm significantly outperforms the major related work, i.e., the Bucket Pivot algorithm. In particular, the error distance of the generated bucket order can be reduced by about 30% on a real paleontological dataset and the noise tolerance can be increased from 30% to 50% in the synthetic dataset.

Wilfred Ng | Qiong Fang | Jianlin Feng

[1] David P. Williamson,et al. Deterministic pivoting algorithms for constrained ranking and clustering problems , 2007, SODA '07.

[2] Yoram Singer,et al. Learning to Order Things , 1997, NIPS.

[3] Evimaria Terzi,et al. Efficient Algorithms for Sequence Segmentation , 2006, SDM.

[4] Moni Naor,et al. Rank aggregation methods for the Web , 2001, WWW '01.

[5] János Podani,et al. REARRANGEMENT OF ECOLOGICAL DATA MATRICES VIA MARKOV CHAIN MONTE CARLO SIMULATION , 2005 .

[6] Heikki Mannila,et al. Seriation in Paleontological Data Using Markov Chain Monte Carlo Methods , 2006, PLoS Comput. Biol..

[7] Heikki Mannila,et al. Finding partial orders from unordered 0-1 data , 2005, KDD '05.

[8] Carmel Domshlak,et al. Rank Aggregation for Automatic Schema Matching , 2007, IEEE Transactions on Knowledge and Data Engineering.

[9] Nir Ailon,et al. Aggregation of Partial Rankings, p-Ratings and Top-m Lists , 2007, SODA '07.

[10] Ronald Fagin,et al. Comparing and aggregating rankings with ties , 2004, PODS '04.

[11] Ronald Fagin,et al. Efficient similarity search and classification via rank aggregation , 2003, SIGMOD '03.

[12] Aristides Gionis,et al. Algorithms for discovering bucket orders from data , 2006, KDD '06.

[13] Werner Vach,et al. A Bayesian approach to seriation problems in archaeology , 2004, Comput. Stat. Data Anal..

[14] Philip S. Yu,et al. Discovering Partial Orders in Binary Data , 2006, Sixth International Conference on Data Mining (ICDM'06).

[15] Aristides Gionis,et al. Spectral ordering and biochronology of European fossil mammals , 2006, Paleobiology.

[16] Heikki Mannila,et al. Time series segmentation for context recognition in mobile devices , 2001, Proceedings 2001 IEEE International Conference on Data Mining.