Mining relevant information on the Web: a clique-based approach

The role of information management and retrieval in production processes has been gaining in importance in recent years. In this context, the ability to search for and quickly find the small piece of information needed from the huge amount of information available has crucial importance. One category of tools devoted to such a task is represented by search engines. Satisfying the basic needs of the Web user has led to the research of new tools that aim at helping more sophisticated users (communities, companies, interest groups) with more elaborate methods. An example is the use of clustering and classification algorithms or other specific data mining techniques. In such a context, the proper use of a thematic search engine is a crucial tool in supporting and orienting many activities. Several practical and theoretical problems arise in developing such tools, and we try to face some of these in this paper, extending previous work on Web mining. Here we consider two related problems: how to select an appropriate set of keywords for a thematic engine taking into account the semantic and linguistic extensions of the search context, and how to select and rank a subset of relevant pages given a set of search keywords. Both problems are solved using the same framework, based on a graph representation of the available information and on the search of particular node subsets of such a graph. Such subsets are effectively identified by a maximum-weight clique algorithm customized ad hoc for specific problems. The methods have been developed in the framework of a funded research project for the development of new Web search tools, they have been tested on real data, and are currently being implemented in a prototypal thematic search engine. The Web mining method presented in this paper can be applied to Web-based design and manufacturing.

[1]  Hisao Tamaki,et al.  Greedily Finding a Dense Subgraph , 2000, J. Algorithms.

[2]  U. Feige,et al.  On the densest k-subgraph problems , 1997 .

[3]  Alain Hertz,et al.  STABULUS: A technique for finding stable sets in large graphs with tabu search , 1989, Computing.

[4]  Ravi B. Boppana,et al.  Approximating maximum independent sets by excluding subgraphs , 1990, BIT.

[5]  Uriel Feige,et al.  The Dense k -Subgraph Problem , 2001, Algorithmica.

[6]  Roberto Battiti,et al.  Reactive Local Search for the Maximum Clique Problem1 , 2001, Algorithmica.

[7]  Federico Della Croce,et al.  Combining Swaps and Node Weights in an Adaptive Greedy Approach for the Maximum Clique Problem , 2004, J. Heuristics.

[8]  Michel Gendreau,et al.  Solving the maximum clique problem using a tabu search approach , 1993, Ann. Oper. Res..

[9]  Marcus Peinado,et al.  On the Performance of Polynomial-time CLIQUE Approximation Algorithms on Very Large Graphs , 1994 .

[10]  Laura A. Sanchis,et al.  Adaptive, Restart, Randomized Greedy Heuristics for Maximum Clique , 2001, J. Heuristics.

[11]  U. Feige,et al.  On the Densest K-subgraph Problem , 1997 .

[12]  Giovanni Felici,et al.  Improving search results with data mining in a thematic search engine , 2004, Comput. Oper. Res..

[13]  Mauricio G. C. Resende,et al.  A Greedy Randomized Adaptive Search Procedure for Maximum Independent Set , 1994, Oper. Res..

[14]  Emile H. L. Aarts,et al.  Simulated annealing and Boltzmann machines - a stochastic approach to combinatorial optimization and neural computing , 1990, Wiley-Interscience series in discrete mathematics and optimization.

[15]  Elena Marchiori,et al.  Genetic, Iterated and Multistart Local Search for the Maximum Clique Problem , 2002, EvoWorkshops.

[16]  Jaideep Srivastava,et al.  WEBKDD 2002 - Mining Web Data for Discovering Usage Patterns and Profiles , 2003, Lecture Notes in Computer Science.

[17]  J. Håstad Clique is hard to approximate withinn1−ε , 1999 .

[18]  Panos M. Pardalos,et al.  On maximum clique problems in very large graphs , 1999, External Memory Algorithms.

[19]  Venkatesan Guruswami,et al.  Combinatorial feature selection problems , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[20]  Magnús M. Halldórsson,et al.  Approximations of Independent Sets in Graphs , 1998, APPROX.

[21]  Lars Engebretsen,et al.  Clique Is Hard To Approximate Within , 2000 .