Finding Maximal Sequential Patterns in Text Document Collections and Single Documents

In this paper, two algorithms for discovering all the Maximal Sequential Patterns (MSP) in a document collection and in a single document are presented. The proposed algorithms follow the “pattern-growth strategy” where small frequent sequences are found first with the goal of growing them to obtain MSP. Our algorithms process the documents in an incremental way avoiding re-computing all the MSP when new documents are added. Experiments showing the performance of our algorithms and comparing against GSP, DELISP, GenPrefixSpan and cSPADE algorithms over public standard databases are also presented. Povzetek: Predstavljena sta dva algoritma za iskanje najdaljsih zaporedij v besedilu.

[1]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[2]  Carol Peters Cross-Language Evaluation Forum - CLEF 2006 , 2006 .

[3]  Alexander F. Gelbukh,et al.  Terms Derived from Frequent Sequences for Extractive Text Summarization , 2008, CICLing.

[4]  Alberto Téllez-Valero,et al.  Using Machine Learning and Text Mining in Question Answering , 2006, CLEF.

[5]  Manuel Montes-y-Gómez,et al.  A Text Mining Approach for Definition Question Answering , 2006, FinTAL.

[6]  Manuel Montes-y-Gómez,et al.  Using Lexical Patterns for Extracting Hyponyms from the Web , 2007, MICAI.

[7]  Suh-Yin Lee,et al.  Efficient mining of sequential patterns with time constraints by delimited pattern growth , 2005, Knowledge and Information Systems.

[8]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[9]  José Francisco Martínez Trinidad,et al.  Document Clustering Based on Maximal Frequent Sequences , 2006, FinTAL.

[10]  Manuel Montes-y-Gómez,et al.  Enhancing Cross-Language Question Answering by Combining Multiple Question Translations , 2009, CICLing.

[11]  Paolo Rosso,et al.  Authorship Attribution Using Word Sequences , 2006, CIARP.

[12]  Cláudia Antunes,et al.  Generalization of Pattern-Growth Methods for Sequential Pattern Mining with Gap Constraints , 2003, MLDM.

[13]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[14]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[15]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[16]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.