In the present chapter we report on some extensions on the work presented in the first edition of the Encyclopedia of Data Mining. In Caramia and Felici (2005) we have described a method based on clustering and a heuristic search methodbased on a genetic algorithm to extract pages with relevant information for a specific user query in a thematic search engine. Starting from these results we have extended the research work trying to match some issues related to the semantic aspects of the search, focusing on the keywords that are used to establish the similarity among the pages that result from the query. Complete details on this method, here omitted for brevity, can be found in Caramia and Felici (2006). Search engines technologies remain a strong research topic, as new problems and new demands from the market and the users arise. The process of switching from quantity (maintaining and indexing large databases of web pages and quickly select pages matching some criterion) to quality (identifying pages with a high quality for the user), already highlighted in Caramia and Felici (2005), has not been interrupted, but has gained further energy, being motivated by the natural evolution of the internet users, more selective in their choice of the search tool and willing to pay the price of providing extra feedback to the system and wait more time to have their queries better matched. In this framework, several have considered the use of data mining and optimization techniques, that are often referred to as web mining (for a recent bibliography on this topic see, e.g., Getoor, Senator, Domingos, and Faloutsos, 2003 and Zaiane, Srivastava, Spiliopoulou, and Masand, 2002). The work described in this chapter is bases on clustering techniques to identify, in the set of pages resulting from a simple query, subsets that are homogeneous with respect to a vectorization based on context or profile; then, a number of small and potentially good subsets of pages is constructed, extracting from each cluster the pages with higher scores. Operating on these subsets with a genetic algorithm, a subset with a good overall score and a high internal dissimilarity is identified. A related problem is then considered: the selection of a subset of pages that are compliant with the search keywords, but that also are characterized by the fact that they share a large subset of words different from the search keywords. This characteristic represents a sort of semantic connection of these pages that may be of use to spot some particular aspects of the information present in the pages. Such a task is accomplished by the construction of a special graph, whose maximumweight clique and k-densest subgraph should represent the page subsets with the desired properties. In the following we summarize the main background topics and provide a synthetic description of the methods. Interested readers may find additional information in Caramia and Felici (2004), Caramia and Felici (2005), and Caramia and Felici (2006).
[1]
Mohand Boughanem,et al.
Genetic Approach to Query Space Exploration
,
2004,
Information Retrieval.
[2]
Philip Calvert,et al.
Encyclopedia of Data Warehousing and Mining
,
2006
.
[3]
Oren Etzioni,et al.
Grouper: A Dynamic Clustering Interface to Web Search Results
,
1999,
Comput. Networks.
[4]
Anton Leuski,et al.
Evaluating document clustering for interactive information retrieval
,
2001,
CIKM '01.
[5]
Donald H. Kraft,et al.
GENETIC ALGORITHMS FOR QUERY OPTIMIZATION IN INFORMATION RETRIEVAL: RELEVANCE FEEDBACK
,
1997
.
[6]
Jorng-Tzong Horng,et al.
Applying genetic algorithms to query optimization in document retrieval
,
2000,
Inf. Process. Manag..
[7]
Hsinchun Chen.
Machine learning for information retrieval: neural networks, symbolic learning, and genetic algorithms
,
1995
.
[8]
Oren Etzioni,et al.
Web document clustering: a feasibility demonstration
,
1998,
SIGIR '98.
[9]
Giovanni Felici,et al.
Mining relevant information on the Web: a clique-based approach
,
2006
.
[10]
Giovanni Felici,et al.
Improving search results with data mining in a thematic search engine
,
2004,
Comput. Oper. Res..
[11]
Anil K. Jain,et al.
Data clustering: a review
,
1999,
CSUR.
[12]
Herna L. Viktor,et al.
Visual Data Mining from Visualization to Visual Information Mining
,
2009,
Encyclopedia of Data Warehousing and Mining.