Clustering by means of unsupervised decision trees or hierarchical and K-means-like algorithm

A classical information retrieval system returns a list of documents to a user query. The answer list is often so long that users cannot explore all the documents retrieved. A classification of the retrieved documents allows to thematically organize them and to improve precision. In this paper, we present and compare two text classification algorithms. The first one is a clustering algorithm (K-Means-like) initialized with a partial hierarchical classification. The second one is a new algorithm that relies on unsupervised decision trees (UDTs). The indexing methods we use (TF-IDF weighting scheme, cosine similarity in the vector space model) prevent from really considering all the subjects dealt with in the texts. A better way to take all the themes into account is to cluster sentences from documents instead of documents as a whole. This is achieved the second method we propose. The effectiveness of these methods is evaluated over Amaryllis'99 corpora and queries. Since these methods are applied during a post-processing phase, they can be used with any IR-system which returns a list of documents. The methods presented here allow to obtain significant results and improvement compared with a search without classification In order to verify that improvement is due to the methods and not to the sharing out of items into classes, the results obtained are compared with those of a random classification.

[1]  Robert B. Allen,et al.  An interface for navigating clustered document sets returned by queries , 1993, COCS '93.

[2]  Bruno Landi,et al.  AMARYLLIS: an evaluation experiment on search engines in a French-speaching context , 1998 .

[3]  Renato De Mori,et al.  Keyword classification trees for speech understanding systems , 1993 .

[4]  Eleazar Eskin,et al.  Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning , 1999, EMNLP.

[5]  John D. Lafferty,et al.  Decision Tree Models Applied to the Labeling of Text with Parts-of-Speech , 1992, HLT.

[6]  Ellen M. Voorhees,et al.  Overview of the seventh text retrieval conference (trec-7) [on-line] , 1999 .

[7]  Michelle Q. Wang Baldonado,et al.  SONIA: a service for organizing networked information autonomously , 1998, DL '98.

[8]  Peter Jansen,et al.  Effectiveness of Clustering in Ad-Hoc Retrieval , 1998, TREC.

[9]  Marc El-Bèze,et al.  Query Length, Number of Classes and Routes through Clusters: Experiments with a Clustering Method for Information Retrieval , 1999, ICSC.

[10]  Gilbert Saporta,et al.  Probabilités, Analyse des données et statistique , 1991 .

[11]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[12]  Jan O. Pedersen,et al.  Almost-constant-time clustering of arbitrary corpus subsets4 , 1997, SIGIR '97.

[13]  P. Bellot Méthodes de classification et de segmentation locales non supervisées pour la recherche documentaire , 2000 .

[14]  Laurent Schmitt,et al.  Évaluation des outils d'accès à l'information textuelle : les expériences américaine (TREC) et française (AMARYLLIS) , 1999 .

[15]  Lalit R. Bahl,et al.  A tree-based statistical language model for natural language speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[16]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[17]  Renato De Mori,et al.  The Application of Semantic Classification Trees to Natural Language Understanding , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Shinichi Morishita,et al.  On Classification and Regression , 1998, Discovery Science.

[19]  Stuart L. Crawford,et al.  Classification Trees for Information Retrieval , 1991, ML.

[20]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[21]  U. Fayyad On the induction of decision trees for multiple concept learning , 1991 .

[22]  Frank Srnadja Lexical Co-occurrence: The Missing Link , 1989 .

[23]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[24]  Eric W. Brown,et al.  The GURU System in TREC-6 , 1997, TREC.

[25]  Marc El-Bèze,et al.  Introduction of rules into a stochastic approach for language modelling , 1999 .

[26]  Hong-Yeop Song,et al.  A New Criterion in Selection and Discretization of Attributes for the Generation of Decision Trees , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Hinrich Schütze,et al.  Projections for efficient document clustering , 1997, SIGIR '97.

[28]  E. Michael Keen,et al.  Presenting Results of Experimental Retrieval Comparisons , 1997, Inf. Process. Manag..