论文信息 - Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance - 字舞流文

Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance

Clustering short length texts is a difficult task itself, but adding the narrow domain characteristic poses an additional challenge for current clustering methods. We addressed this problem with the use of a new measure of distance between documents which is based on the symmetric Kullback-Leibler distance. Although this measure is commonly used to calculate a distance between two probability distributions, we have adapted it in order to obtain a distance value between two documents. We have carried out experiments over two different narrow-domain corpora and our findings indicates that it is possible to use this measure for the addressed problem obtaining comparable results than those which use the Jaccard similarity measure.

Paolo Rosso | David Pinto | José-Miguel Benedí | David Pinto | Paolo Rosso | J. Benedí

[1] CarpinetoClaudio,et al. An information-theoretic approach to automatic query expansion , 2001 .

[2] Paolo Rosso,et al. KnCr: A Short-Text Narrow-Domain Sub-Corpus of Medline , 2006 .

[3] Sang-Yong Han,et al. Fast Clustering Algorithm for Information Organization , 2003, CICLing.

[4] Robert B. Ash,et al. Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[5] Paolo Rosso,et al. An Approach to Clustering Abstracts , 2005, NLDB.

[6] Yiming Yang,et al. Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[7] Paolo Rosso,et al. Uso del punto de transición en la selección de términos índice para agrupamiento de textos cortos , 2005, Proces. del Leng. Natural.

[8] Renato De Mori,et al. A fuzzy decision strategy for topic identification and dynamic selection of language models , 2000, Signal Process..

[9] Steffen Staab,et al. Feature Weighting for Co-occurrence-based Classification of Words , 2004, COLING.

[10] Luis Alfonso Ureña López,et al. Text Categorization using bibliographic records: beyond document content , 2005, Proces. del Leng. Natural.

[11] M. F. Porter,et al. An algorithm for suffix stripping , 1997 .

[12] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[13] Neri Merhav,et al. A measure of relative entropy between individual sequences with application to universal classification , 1993, IEEE Trans. Inf. Theory.

[14] Huaiyu Zhu. On Information and Sufficiency , 1997 .

[15] Alexander Dekhtyar,et al. Information Retrieval , 2018, Lecture Notes in Computer Science.

[16] Wei-Ying Ma,et al. An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[17] Renato De Mori,et al. Spoken Dialogues with Computers , 1998 .

[18] Paolo Rosso,et al. Clustering Abstracts of Scientific Texts Using the Transition Point Technique , 2006, CICLing.

[19] Alexander F. Gelbukh,et al. Clustering Abstracts Instead of Full Texts , 2004, TSD.

[20] Flemming Topsøe,et al. Jensen-Shannon divergence and Hilbert space embedding , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[21] S. C. Johnson. Hierarchical clustering schemes , 1967, Psychometrika.

[22] Brigitte Bigi,et al. Using Kullback-Leibler Distance for Text Categorization , 2003, ECIR.

[23] Rajib Mall,et al. A Comparative Study of Clustering Algorithms , 2006 .

[24] P. Burman. A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods , 1989 .

[25] Andrew Donald Booth,et al. A "Law" of Occurrences for Words of Low Frequency , 1967, Inf. Control..

[26] Ido Dagan,et al. Similarity-Based Models of Word Cooccurrence Probabilities , 1998, Machine Learning.

[27] Paolo Rosso,et al. A Comparative Study of Clustering Algorithms on Narrow-Domain Abstracts , 2006, Proces. del Leng. Natural.

[28] Yan Huang,et al. Vocabulary and language model adaptation using information retrieval , 2004, INTERSPEECH.