PREFCA: A portal retrieval engine based on formal concept analysis

Abstract The web is a network of linked sites whereby each site either forms a physical portal or a standalone page. In the former case, the portal presents an access point to its embedded web pages that coherently present a specific topic. In the latter case, there are millions of standalone web pages, that are scattered throughout the web, having the same topic and could be conceptually linked together to form virtual portals. Search engines have been developed to help users in reaching the adequate pages in an efficient and effective manner. All the known current search engine techniques rely on the web page as the basic atomic search unit. They ignore the conceptual links, that reveal the implicit web related meanings, among the retrieved pages. However, building a semantic model for the whole portal may contain more semantic information than a model of scattered individual pages. In addition, user queries can be poor and contain imprecise terms that do not reflect the real user intention. Consequently, retrieving the standalone individual pages that are directly related to the query may not satisfy the user’s need. In this paper, we propose PREFCA, a P ortal R etrieval E ngine based on F ormal C oncept A nalysis that relies on the portal as the main search unit. PREFCA consists of three phases: First, the information extraction phase that is concerned with extracting portal’s semantic data. Second, the formal concept analysis phase that utilizes formal concept analysis to discover the conceptual links among portal and attributes. Finally, the information retrieval phase where we propose a portal ranking method to retrieve ranked pairs of portals and embedded pages. Additionally, we apply the network analysis rules to output some portal characteristics. We evaluated PREFCA using two data sets, namely the Forum for Information Retrieval Evaluation 2010 and ClueWeb09 (category B) test data, for physical and virtual portals respectively. PREFCA proves higher F-measure accuracy, better Mean Average Precision ranking and comparable network analysis and efficiency results than other search engine approaches, namely Term Frequency Inverse Document Frequency (TF-IDF), Latent Semantic Analysis (LSA), and BM25 techniques. As well, it gains high Mean Average Precision in comparison with learning to rank techniques. Moreover, PREFCA also gains better reach time than Carrot as a well-known topic-based search engine.

[1]  Peter W. Eklund,et al.  Concept Similarity and Related Categories in SearchSleuth , 2008, ICCS.

[2]  Masaru Kitsuregawa,et al.  Link Based Clustering of Web Search Results , 2001, WAIM.

[3]  J. Bordat Calcul pratique du treillis de Galois d'une correspondance , 1986 .

[4]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[5]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[6]  Claudio Carpineto,et al.  Exploiting the Potential of Concept Lattices for Information Retrieval with CREDO , 2004, J. Univers. Comput. Sci..

[7]  Olivier Raynaud,et al.  Practical Use of Formal Concept Analysis in Service-Oriented Computing , 2012, ICFCA.

[8]  Claudio Carpineto,et al.  Concept data analysis - theory and applications , 2004 .

[9]  Peter W. Eklund,et al.  Concept similarity and related categories in information retrieval using formal concept analysis , 2012, Int. J. Gen. Syst..

[10]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[11]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[12]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[13]  Víctor Codocedo,et al.  A semantic approach to concept lattice-based information retrieval , 2014, Annals of Mathematics and Artificial Intelligence.

[14]  Víctor Codocedo,et al.  A Contribution to Semantic Indexing and Retrieval Based on FCA - An Application to Song Datasets , 2012, CLA.

[15]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[16]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[17]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[18]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[19]  Shourya Roy,et al.  A hierarchical monothetic document clustering algorithm for summarization and browsing search results , 2004, WWW '04.

[20]  Christoph Mangold,et al.  A survey and classification of semantic search approaches , 2007, Int. J. Metadata Semant. Ontologies.

[21]  Paul Compton,et al.  Evolutionary document management and retrieval for specialized domains on the web , 2004, Int. J. Hum. Comput. Stud..

[22]  Mike Thelwall,et al.  Finding similar academic Web sites with links, bibliometric couplings and colinks , 2004, Inf. Process. Manag..

[23]  Dawid Weiss,et al.  A survey of Web clustering engines , 2009, CSUR.

[24]  Rohana K. Rajapakse,et al.  Text retrieval with more realistic concept matching and reinforcement learning , 2006, Inf. Process. Manag..

[25]  Qinghua Zheng,et al.  A Survey of Faceted Search , 2013, J. Web Eng..

[26]  Suresh Kumar,et al.  A Comparative Analysis of Keyword- and Semantic-Based Search Engines , 2013, ICACNI.

[27]  J. Friedman Stochastic gradient boosting , 2002 .

[28]  Dawid Weiss,et al.  Conceptual Clustering Using Lingo Algorithm: Evaluation on Open Directory Project Data , 2004, Intelligent Information Systems.

[29]  Ruairí de Fréin,et al.  Multilayered, Blocked Formal Concept Analyses for Adaptive Image Compression , 2014, ICFCA.

[30]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[31]  Paul Compton,et al.  A Hybrid Browsing Mechanism Using Conceptual Scales , 2006, PKAW.

[32]  Jesús Medina,et al.  On Information Retrieval in Morphological Image and Signal Processing , 2013, CLA.

[33]  Bjoern Koester,et al.  Conceptual Knowledge Retrieval with FooCA: Improving Web Search Engine Results with Contexts and Concept Hierarchies , 2006, ICDM.

[34]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[35]  Ilvovsky Dmitry,et al.  FCA-based Search for Duplicate Objects in Ontologies , 2013 .

[36]  Sergei O. Kuznetsov,et al.  Comparing performance of algorithms for generating concept lattices , 2002, J. Exp. Theor. Artif. Intell..

[37]  Rokia Missaoui,et al.  INCREMENTAL CONCEPT FORMATION ALGORITHMS BASED ON GALOIS (CONCEPT) LATTICES , 1995, Comput. Intell..

[38]  Derrick G. Kourie,et al.  AddIntent: A New Incremental Algorithm for Constructing Concept Lattices , 2004, ICFCA.

[39]  Bernhard Ganter,et al.  Two Basic Algorithms in Concept Analysis , 2010, ICFCA.

[40]  Emmanuel Nauer,et al.  CreChainDo: an iterative and interactive Web information retrieval system based on lattices , 2009, Int. J. Gen. Syst..

[41]  Jonas Poelmans,et al.  Text Mining Scientific Papers: A Survey on FCA-Based Information Retrieval Research , 2012, ICDM.

[42]  Jonas Poelmans,et al.  Formal concept analysis in knowledge processing: A survey on applications , 2013, Expert Syst. Appl..

[43]  Nenad Stojanovic,et al.  On using query neighbourhood for better navigation through a product catalog: SMART approach , 2004, IEEE International Conference on e-Technology, e-Commerce and e-Service, 2004. EEE '04. 2004.

[44]  Pedro Pablo Gómez-Martín,et al.  Iterative Software Design of Computer Games through FCA , 2011, CLA.

[45]  Mansaf Alam,et al.  A Review on Clustering of Web Search Result , 2012, ACITY.

[46]  Maliha Majid Qureshi Comparative Analysis of Semantic Search Engines Based on Requirement Space Pyramid , 2013 .

[47]  Amedeo Napoli,et al.  BR-Explorer: An FCA-based algorithm for Information Retrieval , 2006 .

[48]  Peter W. Eklund,et al.  Navigation and Annotation with Formal Concept Analysis , 2008, PKAW.

[49]  R. Hanneman Introduction to Social Network Methods , 2001 .

[50]  Lei Guo,et al.  The Optimization in News Search Engine Using Formal Concept Analysis , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[51]  Dawid Weiss,et al.  A concept-driven algorithm for clustering search results , 2005, IEEE Intelligent Systems.

[52]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[53]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.