Finding Co-occurring Text Phrases by Combining Sequence and Frequent Set Discovery

A signiicant amount of data resides in loosely structured text collections. The concept of text mining has recently been introduced in order to utilize these resources in data mining driven decision making. In our approach, we consider nding multi-term text phrases that tend to co-occur in the documents of a document collection. We combine and further develop two techniques, nding frequent sequences and nding frequent sets, and discuss their suitabil-ity for text mining. The process presented in this paper contains two major phases. In the rst phase, maximal frequent sequences are extracted from documents, i.e., such sequences of words that are frequent in the document collection and that are not contained in any other longer frequent sequence. A sequence is considered to be frequent if it appears in at least documents, when is a given frequency threshold. For instance, we may require the sequences to occur in at least 10 documents. In the second phase, co-occurrences of the maximal frequent sequences are found by discovering frequent sets of the sequences, i.e., which sequences tend to co-occur in several documents. We have implemented the methods and experimented with a news collection. The experiments reveal many characteristics of textual data, which aaect the further development and application of the methods.

[1]  Helena Ahonen Knowledge Discovery in Documents by Extracting Frequent Word Sequences , 1999, Libr. Trends.

[2]  Ronen Feldman,et al.  Document Explorer: Discovering Knowledge in Document Collections , 1997, ISMIS.

[3]  Helena Ahonen-Myka Finding All Maximal Frequent Sequences in Text , 1999 .

[4]  Heikki Mannila,et al.  Efficient Algorithms for Discovering Association Rules , 1994, KDD Workshop.

[5]  Mika Klemettinen,et al.  Mining in the Phrasal Frontier , 1997, PKDD.

[6]  Mika Klemettinen,et al.  Applying data mining techniques for descriptive phrase extraction in digital document collections , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[7]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.