Topic relevance and diversity in information retrieval from large datasets: A multi-objective evolutionary algorithm approach

Abstract Enabling effective information search is an increasing problem, as technology enhances the ability to publish information rapidly, and large quantities of information are instantly available for retrieval. In this scenario, topical search is the process of searching for material that is relevant to a given topic. Multi-objective Evolutionary Algorithms have demonstrated great potential for addressing the topical search problem in very large datasets. In an evolutionary approach to topical search, a population of queries is automatically generated from a given topic, and the population of queries then evolves towards successively better candidate queries. Despite the promise of this approach, previous studies have revealed a common genotypic phenomenon: throughout evolution, the population tends to converge to almost identical sets of terms. This situation reduces the solution set to a few queries and leads to the exploration of a very limited region of the search space, which constitutes a limitation when users require different options from a topical search tool. This paper proposes and evaluates strategies to favor diversity in evolutionary topical search. These strategies rely on novel fitness functions, different parameterization for the crossover and mutation rates, and the use of multiple populations to favor diversity preservation. Experimental results conducted using these strategies in combination with the NSGA-II algorithm on a dataset consisting of more than 350,000 labeled web pages indicate that the proposed strategies show great promise for searching very large datasets, by helping to achieve query and search result diversity without giving up precision.

[1]  John W. Fowler,et al.  A multi-population genetic algorithm to solve multi-objective scheduling problems for parallel machines , 2003, Comput. Oper. Res..

[2]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[3]  Oleg A. Prokopyev,et al.  The equitable dispersion problem , 2009, Eur. J. Oper. Res..

[4]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[5]  O. Cordón,et al.  Automatic Learning of Multiple Extended Boolean Queries by Multiobjective GA-P Algorithms , 2004 .

[6]  Thad Starner,et al.  Remembrance Agent: A Continuously Running Automated Information Retrieval System , 1996, PAAM.

[7]  Marco Laumanns,et al.  PISA: A Platform and Programming Language Independent Interface for Search Algorithms , 2003, EMO.

[8]  W. Bruce Croft,et al.  Diversifying query suggestions based on query documents , 2014, SIGIR.

[9]  Gianni Amati,et al.  Probability models for information retrieval based on divergence from randomness , 2003 .

[10]  Shaozi Li,et al.  Making intelligent business decisions by mining the implicit relation from bloggers’ posts , 2010, Soft Comput..

[11]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[12]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[13]  Ana Gabriela Maguitman,et al.  Using genetic algorithms to evolve a population of topical queries , 2008, Inf. Process. Manag..

[14]  Ali Hamzeh,et al.  TOWARDS ENHANCING SOLUTION SPACE DIVERSITY IN MULTI -OBJECTIVE OPTIMIZATION : A HYPERVOLUME -BASED APPROACH , 2012 .

[15]  Lawrence Birnbaum,et al.  Information access in context , 2001, Knowl. Based Syst..

[16]  Frederick E. Petry,et al.  Fuzzy Information Retrieval Using Genetic Algorithms and Relevance Feedback. , 1993 .

[17]  Jian-Yun Nie,et al.  Diversified query expansion using conceptnet , 2013, CIKM.

[18]  Ravi Kumar,et al.  Searching with context , 2006, WWW '06.

[19]  Gary B. Lamont,et al.  Evolutionary Algorithms for Solving Multi-Objective Problems , 2002, Genetic Algorithms and Evolutionary Computation.

[20]  Giovanni Felici,et al.  Improving search results with data mining in a thematic search engine , 2004, Comput. Oper. Res..

[21]  Reiko Tanese,et al.  Distributed Genetic Algorithms , 1989, ICGA.

[22]  Rishabh Mehrotra,et al.  Topics, Tasks & Beyond: Learning Representations for Personalization , 2015, WSDM.

[23]  Benjamín Barán,et al.  Multi-objective maximum diversity problem , 2017, 2017 XLIII Latin American Computer Conference (CLEI).

[24]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[25]  Ankur Sinha,et al.  Automated query learning with Wikipedia and genetic programming , 2010, Artif. Intell..

[26]  A. J. Umbarkar,et al.  REVIEW OF PARALLEL GENETIC ALGORITHM BASED ON COMPUTING PARADIGM AND DIVERSITY IN SEARCH SPACE , 2013, SOCO 2013.

[27]  Ana Gabriela Maguitman,et al.  Suggesting novel but related topics: towards context-based support for knowledge model extension , 2005, IUI '05.

[28]  Deepti Gupta,et al.  An Overview of methods maintaining Diversity in Genetic Algorithms , 2012 .

[29]  Filippo Menczer,et al.  Efficient and Scalable Pareto Optimization by Evolutionary Local Selection Algorithms , 2000, Evolutionary Computation.

[30]  David E. Goldberg,et al.  Genetic Algorithms with Sharing for Multimodalfunction Optimization , 1987, ICGA.

[31]  Craig MacDonald,et al.  Search Result Diversification , 2015, Found. Trends Inf. Retr..

[32]  R. K. Ursem Multi-objective Optimization using Evolutionary Algorithms , 2009 .

[33]  Michael R. Lyu,et al.  Diversifying Query Suggestion Results , 2010, AAAI.

[34]  Gregorio Toscano Pulido Optimización multiobjetivo usando un micro algoritmo genético , 2001 .

[35]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[36]  Charles Gide,et al.  Cours d'économie politique , 1911 .

[37]  K. Dejong,et al.  An analysis of the behavior of a class of genetic adaptive systems , 1975 .

[38]  Hao Hu,et al.  Diversifying Query Suggestions by Using Topics from Wikipedia , 2013, 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[39]  Jade Goldstein-Stewart,et al.  The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries , 1998, SIGIR Forum.

[40]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[41]  Parikshit Sondhi,et al.  Using query context models to construct topical search engines , 2010, IIiX.

[42]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[43]  Charles J Goodnight,et al.  EXPERIMENTAL STUDIES OF COMMUNITY EVOLUTION I: THE RESPONSE TO SELECTION AT THE COMMUNITY LEVEL , 1990, Evolution; international journal of organic evolution.

[44]  Hsinchun Chen,et al.  A Machine Learning Approach to Inductive Query by Examples: An Experiment Using Relevance Feedback, ID3, Genetic Algorithms, and Simulated Annealing , 1998, J. Am. Soc. Inf. Sci..

[45]  Michael L. Mauldin,et al.  Maintaining Diversity in Genetic Search , 1984, AAAI.

[46]  Ana Gabriela Maguitman,et al.  Multiobjective evolutionary algorithms for context-based search , 2010, J. Assoc. Inf. Sci. Technol..

[47]  Craig MacDonald,et al.  Exploiting query reformulations for web search result diversification , 2010, WWW '10.

[48]  Francisco Herrera,et al.  A study of the use of multi-objective evolutionary algorithms to learn Boolean queries: A comparative study , 2009 .

[49]  Lalit M. Patnaik,et al.  Adaptive probabilities of crossover and mutation in genetic algorithms , 1994, IEEE Trans. Syst. Man Cybern..

[50]  Bo Zhao,et al.  Contextual insights , 2014, WWW '14 Companion.

[51]  Ghosh Sanchita,et al.  Evolutionary Algorithm Based Techniques to Handle Big Data , 2016 .

[52]  Ruth E. Chambers Topics , 1973, Seminars in Perinatology.

[53]  C. Goodnight,et al.  EXPERIMENTAL STUDIES OF COMMUNITY EVOLUTION II: THE ECOLOGICAL BASIS OF THE RESPONSE TO COMMUNITY SELECTION , 1990, Evolution; international journal of organic evolution.

[54]  Jay B. Ghosh,et al.  Computational aspects of the maximum diversity problem , 1996, Oper. Res. Lett..

[55]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[56]  Ben He,et al.  Terrier : A High Performance and Scalable Information Retrieval Platform , 2022 .

[57]  Jie Zhao,et al.  An Ontology-Based Approach to Query Suggestion Diversification , 2014, ICONIP.

[58]  Qingfu Zhang,et al.  Distributed evolutionary algorithms and their models: A survey of the state-of-the-art , 2015, Appl. Soft Comput..

[59]  Filippo Menczer,et al.  Dynamic extraction topic descriptors and discriminators: towards automatic context-based topic search , 2004, CIKM '04.

[60]  Ana Gabriela Maguitman,et al.  Multiobjective evolutionary algorithms for context-based search , 2010 .

[61]  Robert R. Korfhage,et al.  Query Optimization in Information Retrieval Using Genetic Algorithms , 1993, ICGA.

[62]  Dorothea Heiss-Czedik,et al.  An Introduction to Genetic Algorithms. , 1997, Artificial Life.

[63]  C. Darwin On the Origin of Species by Means of Natural Selection: Or, The Preservation of Favoured Races in the Struggle for Life , 2019 .

[64]  Ernesto Benini,et al.  Genetic Diversity as an Objective in Multi-Objective Evolutionary Algorithms , 2003, Evolutionary Computation.

[65]  David Hawking,et al.  Overview of the TREC-2001 Web track , 2002 .

[66]  Ana Gabriela Maguitman,et al.  Exploiting Rich Context: An Incremental Approach to Context-Based Web Search , 2005, CONTEXT.

[67]  Ana Gabriela Maguitman,et al.  Integrating argumentation technologies and context-based search for intelligent processing of citizens' opinion in social media , 2012, ICEGOV.

[68]  Filippo Menczer,et al.  A Semantic Framework for Evaluating Topical Search Methods , 2011, CLEI Electron. J..

[69]  Filippo Menczer,et al.  Mining for Topics to Suggest Knowledge Model Extensions , 2016, ACM Trans. Knowl. Discov. Data.

[70]  Hsinchun Chen,et al.  A Machine Learning Approach to Inductive Query by Examples : An Experiment Using Relevance Feedback , ID 3 , Genetic Algorithms , and Simulated Annealing , 1998 .

[71]  David Leake,et al.  Experience-based support for human-centered knowledge modeling , 2014, Knowl. Based Syst..

[72]  Robert L. Hagin Investment Management: Portfolio Diversification, Risk, and Timing--Fact and Fiction , 2003 .

[73]  Wei Chu,et al.  Modeling the impact of short- and long-term behavior on search personalization , 2012, SIGIR '12.

[74]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[75]  Kejun Zhang,et al.  A topic-specific Web crawler based on content and structure mining , 2013, Proceedings of 2013 3rd International Conference on Computer Science and Network Technology.

[76]  Claudio Carpineto,et al.  Evaluating subtopic retrieval methods: Clustering versus diversification of search results , 2012, Inf. Process. Manag..

[77]  Lalit M. Patnaik,et al.  Genetic algorithms: a survey , 1994, Computer.

[78]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[79]  Maoguo Gong,et al.  A Multipopulation Coevolutionary Strategy for Multiobjective Immune Algorithm , 2014, TheScientificWorldJournal.

[80]  Pragati Bhatnagar,et al.  Improving pseudo relevance feedback based query expansion using genetic fuzzy approach and semantic similarity notion , 2014, J. Inf. Sci..

[81]  Micael Gallego,et al.  GRASP and path relinking for the max-min diversity problem , 2010, Comput. Oper. Res..

[82]  Anne Kuhn,et al.  Population genetic diversity and fitness in multiple environments , 2010, BMC Evolutionary Biology.

[83]  Jessica Andrea Carballido,et al.  Using Computational Intelligence and Parallelism to Solve an Industrial Design Problem , 2006, IBERAMIA-SBIA.

[84]  Kai Zheng,et al.  Supporting information retrieval from electronic health records: A report of University of Michigan's nine-year experience in developing and using the Electronic Medical Record Search Engine (EMERSE) , 2015, J. Biomed. Informatics.

[85]  A. E. Eiben,et al.  Introduction to Evolutionary Computing , 2003, Natural Computing Series.

[86]  Fu Lin,et al.  An Improved Genetic Algorithm For Multi-Objective Optimization , 2005, Sixth International Conference on Parallel and Distributed Computing Applications and Technologies (PDCAT'05).

[87]  Marc Schoenauer,et al.  Polar IFS+Parisian Genetic Programming=Efficient IFS Inverse Problem Solving , 2000, Genetic Programming and Evolvable Machines.

[88]  Ramanathan V. Guha,et al.  User Modeling for a Personal Assistant , 2015, WSDM.

[89]  Ana Gabriela Maguitman,et al.  A semi-supervised incremental algorithm to automatically formulate topical queries , 2009, Inf. Sci..

[90]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[91]  Bhabani Shankar Prasad Mishra,et al.  Parallel GA in Big Data Analysis , 2016 .

[92]  Qinghua Zheng,et al.  Learning to crawl deep web , 2013, Inf. Syst..

[93]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[94]  Yiqun Liu,et al.  Task-based Recommendation on a Web-Scale , 2015 .