Keyword extraction based on sequential pattern mining

Keyword extraction is to automatically extract keywords that capture the main topic discussed in a given document. In this paper, a new keyword extraction algorithm based on sequential patterns is proposed. By preprocessing, a document is represented as sequences of words where a sequential pattern mining algorithm is applied on, and important sequential patterns are mined that reflect the semantic relatedness between words. Both statistical features and pattern features within words are used to build the keyword extraction model. The algorithm is independent of languages and does not need the help of a semantic dictionary to get the semantic features. Experimental results on Chinese journal articles show that the proposed algorithm always outperforms the baseline method KEA.

[1]  Carl Gutwin,et al.  Domain-Specific Keyphrase Extraction , 1999, IJCAI.

[2]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[3]  Ilyas Cicekli,et al.  Using lexical chains for keyword extraction , 2007, Inf. Process. Manag..

[4]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[5]  R. Agrawal,et al.  Research Report Mining Sequential Patterns: Generalizations and Performance Improvements Limited Distribution Notice Mining Sequential Patterns: Generalizations and Performance Improvements , 1996 .

[6]  H. P. Luhn A stoical approach to mechanized encoding and searching of literary information , 1957 .

[7]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[8]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[9]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[10]  Ma Ying,et al.  A Novel Chinese Text Subject Extraction Method Based on Character Co-occurrence , 2003 .

[11]  Peter D. Turney Coherent Keyphrase Extraction via Web Mining , 2003, IJCAI.

[12]  Richard K. Belew,et al.  Exporting phrases: a statistical analysis of topical language , 1991 .

[13]  Cong Wang,et al.  Keyword Extraction Based on PageRank , 2007, PAKDD.

[14]  Paolo Rosso,et al.  Authorship Attribution Using Word Sequences , 2006, CIARP.

[15]  James Bailey,et al.  Mining minimal distinguishing subsequence patterns with gap constraints , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[16]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[17]  Cai Qing,et al.  An Automatic Keyword Extraction of Chinese Document Algorithm Based on Complex Network Features , 2007 .

[18]  Ken Barker,et al.  Using Noun Phrase Heads to Extract Document Keyphrases , 2000, Canadian Conference on AI.

[19]  Umeshwar Dayal,et al.  PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth , 2001, ICDE 2001.

[20]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[21]  Kuo Zhang,et al.  Keyword extraction based on tf/idf for Chinese news document , 2007, Wuhan University Journal of Natural Sciences.

[22]  Meng Wang,et al.  Beyond Distance Measurement: Constructing Neighborhood Similarity for Video Annotation , 2009, IEEE Transactions on Multimedia.

[23]  Peter D. Turney Learning to Extract Keyphrases from Text , 2002, ArXiv.

[24]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[25]  Qun Liu,et al.  HHMM-based Chinese Lexical Analyzer ICTCLAS , 2003, SIGHAN.

[26]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[27]  Manuel Montes-y-Gómez,et al.  A Text Mining Approach for Definition Question Answering , 2006, FinTAL.

[28]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[29]  Kevin Y. Yip,et al.  Mining periodic patterns with gap requirement from sequences , 2007 .

[30]  Ian H. Witten,et al.  Thesaurus based automatic keyphrase indexing , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[31]  Xindong Wu,et al.  Keyword Extraction Based on Lexical Chains and Word Co-occurrence for Chinese News Web Pages , 2008, 2008 IEEE International Conference on Data Mining Workshops.