We have developed a method that extracts all maximal frequent word sequences from the documents of a collection. A sequence is said to be frequent if it appears in more than ? documents, in which ? is the frequency threshold given. Furthermore, a sequence is maximal, if no other frequent sequence exists that contains this sequence. The words of a sequence do not have to appear in text consecutively.In this paper, we describe briefly the method for finding all maximal frequent word sequences in text and then extend the method for extracting generalized sequences from annotated texts, where each word has a set of additional, e.g. morphological, features attached to it. We aim at discovering patterns which preserve as many features as possible such that the frequency of the pattern still exceeds the frequency threshold given.
[1]
Heikki Mannila,et al.
Fast Discovery of Association Rules
,
1996,
Advances in Knowledge Discovery and Data Mining.
[2]
Heikki Mannila,et al.
Verkamo: Fast Discovery of Association Rules
,
1996,
KDD 1996.
[3]
Helena Ahonen-Myka.
Finding All Maximal Frequent Sequences in Text
,
1999
.
[4]
Ramakrishnan Srikant,et al.
Mining sequential patterns
,
1995,
Proceedings of the Eleventh International Conference on Data Engineering.
[5]
Helena Ahonen.
Knowledge Discovery in Documents by Extracting Frequent Word Sequences
,
1999,
Libr. Trends.
[6]
Heikki Mannila,et al.
Discovering Frequent Episodes in Sequences
,
1995,
KDD.