论文信息 - Validation of Text Clustering Based on Document Contents

Validation of Text Clustering Based on Document Contents

In this paper some results of a new text clustering methodology are presented. A prototype is an interesting document or a part of an extracted, interesting text. The given prototype is matched with the existing document database or the monitored document flow. Our claim is that the new methodology is capable of automatic content-based clustering using the information of the document. To verify this hypothesis an experiment was designed with the Bible. Four different translations, one Greek, one Latin, and two Finnish translations from years 1933/38 and 1992 were selected as test text material. Validation experiments were performed with a designed prototype version of the software application.

Hannu Vanharanta | Barbro Back | Ari Visa | Jarmo Toivonen | Tomi Vesanen

[1] Timo Lahtinen,et al. Automatic indexing: an approach using an index term corpus and combining linguistic and statistical methods , 2000 .

[2] Hannu Vanharanta,et al. Data mining of text as a tool in authorship attribution , 2001, SPIE Defense + Commercial Sensing.

[3] Hannu Vanharanta,et al. Prototype matching finding meaning in the books of the Bible , 2001, Proceedings of the 34th Annual Hawaii International Conference on System Sciences.

[4] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[5] Gerald Salton,et al. Automatic text processing , 1988 .

[6] Gerard Salton,et al. A vector space model for automatic indexing , 1975, CACM.