It is common that text documents are characterised and classified by keywords that the authors use to give and name these text characteristics. Visa et al. (1999; 2000) have, however developed a new methodology based on prototype matching. The prototype is an interesting document or a part of an extracted, interesting text. This prototype is matched with the existing document database or the monitored document flow. Our claim is that the new methodology is capable of extracting meaning automatically from the contents of the document. To verify this hypothesis a test was designed with the Bible. Two different translations, one in English and another in Finnish, were selected as test text material. Verification tests that included the search of the ten nearest books to every book of the Bible were performed with a designed prototype version of the software application. The interesting test results are reported in this paper. The new methodology is based on a hierarchy of self-organizing maps (SOM) and on a smart encoding of words. The words of a text document are encoded. The encoded words are represented as word vectors. The word vectors are clustered by the SOM and this process creates a word map.
[1]
Teuvo Kohonen,et al.
Self-Organizing Maps
,
2010
.
[2]
Hinrich Schütze,et al.
Book Reviews: Foundations of Statistical Natural Language Processing
,
1999,
CL.
[3]
Yiming Yang,et al.
Learning approaches for detecting and tracking news events
,
1999,
IEEE Intell. Syst..
[4]
Vladimir I. Levenshtein,et al.
Binary codes capable of correcting deletions, insertions, and reversals
,
1965
.
[5]
Dunja Mladenic,et al.
Text-learning and related intelligent agents: a survey
,
1999,
IEEE Intell. Syst..
[6]
Hannu Vanharanta,et al.
Toward text understanding: classification of text documents by word map
,
2000,
SPIE Defense + Commercial Sensing.
[7]
Hannu Vanharanta,et al.
Knowledge discovery from text documents based on paragraph maps
,
2000,
Proceedings of the 33rd Annual Hawaii International Conference on System Sciences.