Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance

Clustering short length texts is a difficult task itself, but adding the narrow domain characteristic poses an additional challenge for current clustering methods. We addressed this problem with the use of a new measure of distance between documents which is based on the symmetric Kullback-Leibler distance. Although this measure is commonly used to calculate a distance between two probability distributions, we have adapted it in order to obtain a distance value between two documents. We have carried out experiments over two different narrow-domain corpora and our findings indicates that it is possible to use this measure for the addressed problem obtaining comparable results than those which use the Jaccard similarity measure.

[1]  CarpinetoClaudio,et al.  An information-theoretic approach to automatic query expansion , 2001 .

[2]  Paolo Rosso,et al.  KnCr: A Short-Text Narrow-Domain Sub-Corpus of Medline , 2006 .

[3]  Sang-Yong Han,et al.  Fast Clustering Algorithm for Information Organization , 2003, CICLing.

[4]  Robert B. Ash,et al.  Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[5]  Paolo Rosso,et al.  An Approach to Clustering Abstracts , 2005, NLDB.

[6]  Yiming Yang,et al.  Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[7]  Paolo Rosso,et al.  Uso del punto de transición en la selección de términos índice para agrupamiento de textos cortos , 2005, Proces. del Leng. Natural.

[8]  Renato De Mori,et al.  A fuzzy decision strategy for topic identification and dynamic selection of language models , 2000, Signal Process..

[9]  Steffen Staab,et al.  Feature Weighting for Co-occurrence-based Classification of Words , 2004, COLING.

[10]  Luis Alfonso Ureña López,et al.  Text Categorization using bibliographic records: beyond document content , 2005, Proces. del Leng. Natural.

[11]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[12]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[13]  Neri Merhav,et al.  A measure of relative entropy between individual sequences with application to universal classification , 1993, IEEE Trans. Inf. Theory.

[14]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[15]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[16]  Wei-Ying Ma,et al.  An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[17]  Renato De Mori,et al.  Spoken Dialogues with Computers , 1998 .

[18]  Paolo Rosso,et al.  Clustering Abstracts of Scientific Texts Using the Transition Point Technique , 2006, CICLing.

[19]  Alexander F. Gelbukh,et al.  Clustering Abstracts Instead of Full Texts , 2004, TSD.

[20]  Flemming Topsøe,et al.  Jensen-Shannon divergence and Hilbert space embedding , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[21]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[22]  Brigitte Bigi,et al.  Using Kullback-Leibler Distance for Text Categorization , 2003, ECIR.

[23]  Rajib Mall,et al.  A Comparative Study of Clustering Algorithms , 2006 .

[24]  P. Burman A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods , 1989 .

[25]  Andrew Donald Booth,et al.  A "Law" of Occurrences for Words of Low Frequency , 1967, Inf. Control..

[26]  Ido Dagan,et al.  Similarity-Based Models of Word Cooccurrence Probabilities , 1998, Machine Learning.

[27]  Paolo Rosso,et al.  A Comparative Study of Clustering Algorithms on Narrow-Domain Abstracts , 2006, Proces. del Leng. Natural.

[28]  Yan Huang,et al.  Vocabulary and language model adaptation using information retrieval , 2004, INTERSPEECH.