Clustering Abstracts of Scientific Texts Using the Transition Point Technique

Free access to scientific papers in major digital libraries and other web repositories is limited to only their abstracts. Current keyword-based techniques fail on narrow domain-oriented libraries, e.g., those containing only documents on high energy physics like those of the hep-ex collection of CERN. We propose a simple procedure to cluster abstracts which consists in applying the transition point technique during the term selection process. This technique uses the mid-frequency terms to index the documents due to the fact that they have a high semantic content. In the experiments we have carried out, the transition point approach has been compared with well known unsupervised term selection techniques. Transition point technique shown that it is possible to obtain a better performance than traditional methods. Moreover, we propose an approach to analyse the stability of transition point term selection method.

[1]  Hector Salazar,et al.  El párrafo virtual en la generación de extractos , 2005 .

[2]  Héctor Jiménez Salazar,et al.  Una nueva ponderación para el modelo de espacio vectorial de recuperación de información , 2005 .

[3]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[4]  Héctor Jiménez-Salazar,et al.  An analysis on Frequency of terms for Text Categorization , 2004, Proces. del Leng. Natural.

[5]  Alexander F. Gelbukh,et al.  Clustering Abstracts Instead of Full Texts , 2004, TSD.

[6]  C. J. van Rijsbergen,et al.  Getting into Information Retrieval , 2001, ESSIR.

[7]  Paolo Rosso,et al.  An Approach to Clustering Abstracts , 2005, NLDB.

[8]  Yiming Yang,et al.  Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[9]  Luis Alfonso Ureña López,et al.  Text Categorization using bibliographic records: beyond document content , 2005, Proces. del Leng. Natural.

[10]  Pavel Makagonov,et al.  A Toolkit for Development of the Domain-Oriented Dictionaries for Structuring Document Flows , 2000 .

[11]  Paolo Rosso,et al.  Uso del punto de transición en la selección de términos índice para agrupamiento de textos cortos , 2005, Proces. del Leng. Natural.

[12]  Mireya Tovar,et al.  Combining Keyword Identification Techniques , 2005 .

[13]  Andrew Donald Booth,et al.  A "Law" of Occurrences for Words of Low Frequency , 1967, Inf. Control..

[14]  Héctor Jiménez-Salazar,et al.  Enhancement of DTP Feature Selection Method for Text Categorization , 2005, CICLing.

[15]  Wei-Ying Ma,et al.  An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[16]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[17]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[18]  Sang-Yong Han,et al.  Fast Clustering Algorithm for Information Organization , 2003, CICLing.