A heuristic approach for λ-representative information retrieval from large-scale data

Abstract Retrieving representative information from large-scale data becomes an important research issue nowadays, especially in the context of mobile business/search where the screen size and navigability are limited. This paper focuses on certain aspects of representativeness in database queries and web search, and proposes an approach to extracting a subset of results from original search results in light of high coverage and low redundancy. In the paper, the notion of λ -represent is introduced, which enables us to describe the λ -represent relationship between the sets of data objects. Then, the λ -representative problem is formulated as an extension of the typical set covering problem, which leads to developing a heuristic approach (namely, LamRep) to coping with the problem effectively and efficiently. Notably, LamRep is incorporated with a “vote” mechanism, enhanced with an algorithmic acceleration strategy. Data experiments on benchmark data and a real-world example show that LamRep outperforms the other approaches.

[1]  Jin Zhang,et al.  An efficient incremental method for generating equivalence groups of search results in information retrieval and queries , 2012, Knowl. Based Syst..

[2]  Xiang Lian,et al.  Probabilistic top-k dominating queries in uncertain databases , 2013, Inf. Sci..

[3]  Jihoon Yang,et al.  Extracting sentence segments for text summarization: a machine learning approach , 2000, SIGIR '00.

[4]  Lucas Antiqueira,et al.  A complex network approach to text summarization , 2009, Inf. Sci..

[5]  Ling Shao,et al.  Content-based retrieval of human actions from realistic video databases , 2013, Inf. Sci..

[6]  Yi-Fen Chen,et al.  Herd behavior in purchasing books online , 2008, Comput. Hum. Behav..

[7]  Ludovic Lietard,et al.  A functional interpretation of linguistic summaries of data , 2012, Inf. Sci..

[8]  David R. Karger,et al.  Less is More Probabilistic Models for Retrieving Fewer Relevant Documents , 2006 .

[9]  Yuji Matsumoto,et al.  A new approach to unsupervised text summarization , 2001, SIGIR '01.

[10]  Sihem Amer-Yahia,et al.  Efficient Computation of Diverse Query Results , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[11]  Luis Gravano,et al.  Evaluating top-k queries over Web-accessible databases , 2002, Proceedings 18th International Conference on Data Engineering.

[12]  Guoqing Chen,et al.  A combined measure for representative information retrieval in enterprise information systems , 2011, J. Enterp. Inf. Manag..

[13]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[14]  Nicholas J. Belkin,et al.  Information filtering and information retrieval: two sides of the same coin? , 1992, CACM.

[15]  James Allan,et al.  Using part-of-speech patterns to reduce query ambiguity , 2002, SIGIR '02.

[16]  D. Hochbaum,et al.  Analysis of the greedy approach in problems of maximum k‐coverage , 1998 .

[17]  Sreenivas Gollapudi,et al.  An axiomatic approach for result diversification , 2009, WWW '09.

[18]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[19]  Mohammad Reza Meybodi,et al.  Efficient stochastic algorithms for document clustering , 2013, Inf. Sci..

[20]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[21]  Jin Zhang,et al.  Extracting Representative Information to Enhance Flexible Data Queries , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[22]  John D. Lafferty,et al.  Beyond independent relevance: methods and evaluation metrics for subtopic retrieval , 2003, SIGIR.

[23]  Anthony K. H. Tung,et al.  Finding representative set from massive data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[24]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[25]  Philip Calvert,et al.  The Information Society: A Study in Continuity and Change , 2004 .

[26]  Yue Pan,et al.  Born Unequal: A Study of the Helpfulness of User-Generated Product Reviews , 2011 .

[27]  John D. Lafferty,et al.  A risk minimization framework for information retrieval , 2006, Inf. Process. Manag..

[28]  Gurpreet Singh Lehal,et al.  A Survey of Text Summarization Extractive Techniques , 2010 .

[29]  Wai Lam,et al.  MEAD - A Platform for Multidocument Multilingual Text Summarization , 2004, LREC.

[30]  Uwe Aickelin,et al.  Privileged information for data clustering , 2012, Inf. Sci..

[31]  Ali F. Farhoomand,et al.  Managerial information overload , 2002, CACM.

[32]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[33]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[34]  Etienne E. Kerre,et al.  A General Treatment of Data Redundancy in a Fuzzy Relational Data Model , 1992, J. Am. Soc. Inf. Sci..

[35]  Lotfi A. Zadeh,et al.  Similarity relations and fuzzy orderings , 1971, Inf. Sci..

[36]  Danushka Bollegala,et al.  A preference learning approach to sentence ordering for multi-document summarization , 2012, Inf. Sci..

[37]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[38]  Gerhard Weikum,et al.  Probabilistic information retrieval approach for ranking of database query results , 2006, TODS.

[39]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[40]  Umberto Straccia,et al.  Top-k retrieval for ontology mediated access to relational databases , 2012, Inf. Sci..

[41]  Jörn‐Axel Meyer,et al.  Information overload in marketing management , 1998 .

[42]  Weifa Liang,et al.  Top-k query evaluation in sensor networks under query response time constraint , 2011, Inf. Sci..

[43]  Anne Morris,et al.  The problem of information overload in business organisations: a review of the literature , 2000, Int. J. Inf. Manag..

[44]  M. de Rijke,et al.  Result diversification based on query-specific cluster ranking , 2011, J. Assoc. Inf. Sci. Technol..

[45]  Amanda Spink,et al.  Web Search: Public Searching of the Web , 2011, Information Science and Knowledge Management.

[46]  Man Lung Yiu,et al.  Efficient top-k aggregation of ranked inputs , 2007, TODS.

[47]  Yi-Bing Lin,et al.  A chapter preloading mechanism for e-reader in mobile environment , 2013, Inf. Sci..

[48]  Wenjie Li,et al.  A spectral analysis approach to document summarization: Clustering and ranking sentences simultaneously , 2011, Inf. Sci..

[49]  I-En Liao,et al.  CIS-X: A compacted indexing scheme for efficient query evaluation of XML documents , 2013, Inf. Sci..

[50]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[51]  Jane You,et al.  Visual query processing for efficient image retrieval using a SOM-based filter-refinement scheme , 2012, Inf. Sci..

[52]  Ben Carterette,et al.  Probabilistic models of ranking novel documents for faceted topic retrieval , 2009, CIKM.

[53]  Filip Radlinski,et al.  Learning diverse rankings with multi-armed bandits , 2008, ICML '08.

[54]  Ying Li,et al.  KDD CUP-2005 report: facing a great challenge , 2005, SKDD.

[55]  Wei-Pang Yang,et al.  Text summarization using a trainable summarizer and latent semantic analysis , 2005, Inf. Process. Manag..

[56]  João Paulo Carvalho,et al.  Finding top-k elements in data streams , 2010, Inf. Sci..

[57]  Colin Sowman Dying for information , 2015 .