Design a batched information retrieval system based on a concept-lattice-like structure

Abstract Nowadays, as is envisioned as one of the most popular and challenging research areas due to the rapid growth of web data, information retrieval (IR) serves as a fundamental technology in large scale dataset processing and analyzing. IR systems usually involve handling massive and continuous retrieval requests in data matching, information filtering and other application scenarios. However, in the general applications of IR such as search engines, the individual response time is mostly emphasized and the efficiency of handling massive queries mainly relies on caching or similar technologies. For improving the overall efficiency of handling massive queries, we design a batched information retrieval system which first analyzes a batch of queries and then utilizes the repeats, similarity and correlations among queries to accelerate the retrievals. A concept-lattice-like structure called keyword-DAG (Directed Acyclic Graph) is first exploited to store and organize the similarity among queries. Accordingly a keyword-DAG processing algorithm namely pruning is devised to implement the batched retrieval. Then an incremental ranking algorithm is presented for the batched IR scenarios, which has be demonstrated (both in theory and practice) to be able to remarkably shorten the retrieval time. Finally, an overall planning algorithm is proposed for choosing the optimal pruning path and improving the utilization of memory. The experiment results show that our approach embraces far better performance compared with the traditional separate retrieval method in mass data processing and analyzing scenarios.

[1]  Gabriel Hernán Tolosa,et al.  Cost-aware Intersection Caching and Processing Strategies for In-memory Inverted Indexes , 2014 .

[2]  Gerald Salton,et al.  Automatic text processing , 1988 .

[3]  Fang Xu,et al.  New journal classification methods based on the global h-index , 2015, Inf. Process. Manag..

[4]  Torsten Suel,et al.  Faster top-k document retrieval using block-max indexes , 2011, SIGIR.

[5]  Ricardo Campos,et al.  GTE-Rank: A time-aware search engine to answer time-sensitive queries , 2016, Inf. Process. Manag..

[6]  W. Bruce Croft,et al.  Efficient document retrieval in main memory , 2007, SIGIR.

[7]  Fabrizio Silvestri,et al.  Query-driven document partitioning and collection selection , 2006, InfoScale '06.

[8]  Torsten Suel,et al.  Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[9]  Shimon Even,et al.  Graph Algorithms , 1979 .

[10]  Rossitza Setchi,et al.  Ontology-based personalised retrieval in support of reminiscence , 2013, Knowl. Based Syst..

[11]  Djoerd Hiemstra,et al.  Information Retrieval Models , 2009, Information Retrieval.

[12]  Jinglei Zhao,et al.  A proximity language model for information retrieval , 2009, SIGIR.

[13]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[14]  Özgür Ulusoy,et al.  Cost-Aware Strategies for Query Result Caching in Web Search Engines , 2011, TWEB.

[15]  Bernhard Ganter,et al.  Formal Concept Analysis , 2013 .

[16]  Vikram Singh,et al.  AN EFFECTIVE PRE -PROCESSING ALGORITHM FOR INFORMATION RETRIEVAL SYSTEMS , 2014 .

[17]  Chengfei Liu,et al.  AutoRM: An effective approach for automatic Web data record mining , 2015, Knowl. Based Syst..

[18]  Torsten Suel,et al.  Performance of compressed inverted list caching in search engines , 2008, WWW.

[19]  W. Bruce Croft,et al.  Lexical ambiguity and information retrieval , 1992, TOIS.

[20]  P. Gärdenfors Conceptual spaces as a framework for knowledge representation , 2004 .

[21]  Se-Jong Kim,et al.  Subtopic mining using simple patterns and hierarchical structure of subtopic candidates from web documents , 2015, Inf. Process. Manag..

[22]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[23]  David J. Pearce A space-efficient algorithm for finding strongly connected components , 2016, Inf. Process. Lett..

[24]  Paulo Fernandes,et al.  Estimating term domain relevance through term frequency, disjoint corpora frequency - tf-dcf , 2016, Knowl. Based Syst..

[25]  Dik Lun Lee,et al.  Document Ranking and the Vector-Space Model , 1997, IEEE Softw..

[26]  Claudio Gennaro,et al.  An Approach to Content-Based Image Retrieval Based on the Lucene Search Engine Library , 2010, ECDL.

[27]  James Llinas A Survey and Analysis of Frameworks and Framework Issues for Information Fusion Applications , 2010, HAIS.

[28]  Di Jiang,et al.  TEII: Topic enhanced inverted index for top-k document retrieval , 2015, Knowl. Based Syst..

[29]  Luis M. de Campos,et al.  Use of textual and conceptual profiles for personalized retrieval of political documents , 2016, Knowl. Based Syst..

[30]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[31]  Ding Qiu-lin Knowledge retrieval based on text clustering and distributed Lucene , 2013 .

[32]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[33]  Alistair Moffat,et al.  The design of a high performance information filtering system , 1996, SIGIR '96.

[34]  Aditi Sharan,et al.  A new fuzzy logic-based query expansion model for efficient information retrieval using relevance feedback approach , 2017, Neural Computing and Applications.

[35]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[36]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[37]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[38]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[39]  Sun Jing,et al.  A Strong Classifier Model for Listed Companies Financial Risk Warning , 2015, 2015 Seventh International Conference on Measuring Technology and Mechatronics Automation.