A model for mining relevant and non-redundant information

We propose a relatively simple yet powerful model for choosing relevant and non-redundant pieces of information. The model addresses data mining or information retrieval settings where relevance is measured with respect to a set of key or query objects, either specified by the user or obtained by a data mining step. The problem addressed is not only to identify other relevant objects, but also ensure that they are not related to possible negative query objects, and that they are not redundant with respect to each other. The model proposed here only assumes a similarity or distance function for the objects. It has simple parameterization to allow for different behaviors with respect to query objects. We analyze the model and give two efficient, approximate methods. We illustrate and evaluate the proposed model on different applications: linguistics and social networks. The results indicate that the model and methods are useful in finding a relevant and non-redundant set of results. While this area has been a popular topic of research, our contribution is to provide a simple, generic model that covers several related approaches while providing a systematic model for taking account of positive and negative query objects as well as non-redundancy of the output.

[1]  Filip Radlinski,et al.  Learning diverse rankings with multi-armed bandits , 2008, ICML '08.

[2]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[3]  Hannu Toivonen,et al.  Finding Representative Nodes in Probabilistic Graphs , 2012, Bisociative Knowledge Discovery.

[4]  Anthony K. H. Tung,et al.  Finding representative set from massive data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[5]  Miguel Toro,et al.  Finding representative patterns with ordered projections , 2003, Pattern Recognit..

[6]  Ram Akella,et al.  Active relevance feedback for difficult queries , 2008, CIKM '08.

[7]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[8]  Vahab Mirrokni,et al.  Maximizing Non-Monotone Submodular Functions , 2007, FOCS 2007.

[9]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[10]  Ling Huang,et al.  Fast approximate spectral clustering , 2009, KDD.

[11]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[12]  ChengXiang Zhai,et al.  A study of methods for negative relevance feedback , 2008, SIGIR '08.

[13]  George Kollios,et al.  k-nearest neighbors in uncertain graphs , 2010, Proc. VLDB Endow..

[14]  Ravi Kumar,et al.  Core algorithms in the CLEVER system , 2006, TOIT.

[15]  Yiming Yang,et al.  Learning to rank relevant and novel documents through user feedback , 2010, CIKM.

[16]  Vahab S. Mirrokni,et al.  Maximizing Non-Monotone Submodular Functions , 2011, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[17]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[18]  Theodoros Lappas,et al.  Finding a team of experts in social networks , 2009, KDD.

[19]  Carl D. Meyer,et al.  Matrix Analysis and Applied Linear Algebra , 2000 .

[20]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[21]  Sreenivas Gollapudi,et al.  An axiomatic approach for result diversification , 2009, WWW '09.

[22]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[23]  Fredric C. Gey,et al.  Term importance, Boolean conjunct training, negative terms, and foreign language retrieval: probabilistic algorithms at TREC-5 , 1996, TREC.