论文信息 - Query Length, Number of Classes and Routes through Clusters: Experiments with a Clustering Method for Information Retrieval

Query Length, Number of Classes and Routes through Clusters: Experiments with a Clustering Method for Information Retrieval

A classical information retrieval system ranks documents according to distances between texts and a user query. The answer list is often so long that users cannot examine all the documents retrieved whereas some relevant ones are badly ranked and thus never retrieved. To solve this problem, retrieved documents are automatically clustered. We describe an algorithm based on hierarchical and clustering methods. It classifies the set of documents retrieved by any IR-system. This method is evaluated over the TREC-7 corpora and queries. We show that it improves the results of the retrieval by providing users at least one high precision cluster. The impact of the number of clusters and the way to browse them to build a reordered list are examined. Over TREC corpora and queries, we show that the choice of the number of clusters according to the length of queries improves results compared with a prefixed number.

Marc El-Bèze | Patrice Bellot | P. Bellot | M. El-Bèze

[1] Gerald Kowalski,et al. Information Retrieval Systems: Theory and Implementation , 1997 .

[2] Hinrich Schütze,et al. Projections for efficient document clustering , 1997, SIGIR '97.

[3] Marti A. Hearst,et al. Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[4] Ellen M. Voorhees,et al. Overview of the Seventh Text REtrieval Conference , 1998 .

[5] Peter Jansen,et al. Effectiveness of Clustering in Ad-Hoc Retrieval , 1998, TREC.

[6] Michelle Q. Wang Baldonado,et al. SONIA: a service for organizing networked information autonomously , 1998, DL '98.

[7] Ricardo Baeza-Yates,et al. Information Retrieval: Data Structures and Algorithms , 1992 .

[8] Ellen M. Voorhees,et al. The Text REtrieval Conference (TREC-2001) (10th, Gaithersburg, Maryland, November 13-16, 2001). NIST Special Publication. , 2000 .

[9] Robert B. Allen,et al. An interface for navigating clustered document sets returned by queries , 1993, COCS '93.

[10] Jan O. Pedersen,et al. Almost-constant-time clustering of arbitrary corpus subsets4 , 1997, SIGIR '97.

[11] Alexander Dekhtyar,et al. Information Retrieval , 2018, Lecture Notes in Computer Science.

[12] Ellen M. Voorhees,et al. The seventh text REtrieval conference (TREC-7) , 1999 .

[13] Venkata Subramaniam,et al. Information Retrieval: Data Structures & Algorithms , 1992 .