Keyphrase Extraction and Grouping Based on Association Rules

Keyphrases are important in capturing the content of a document and thus useful for many natural language processing tasks such as Information Retrieval, Document Classification, and Text Summarization. Keyphrase extraction aims to identify multi-word sequences from a collection of documents that more or less correspond to keyphrases. In this paper, we propose a new method for keyphrase extraction based on association rule mining. Redundant multi-word sequences or synonymous phrases inevitably make up a big part of the keyphrases extracted. With association rules, we can also reduce the redundancy by grouping the related keyphrases that have strong co-occurrence frequencies. We further apply our keyphrase extraction and grouping solution to Information Retrieval. By both distinguishing and grouping keyphrases, we are able to achieve improved performance for Information Retrieval.

[1]  Simone Teufel,et al.  An Overview of Evaluation Methods in TREC Ad Hoc Information Retrieval and TREC Question Answering , 2007 .

[2]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[3]  Hatem Haddad,et al.  Towards an effective automatic query expansion process using an association rule mining approach , 2012, Journal of Intelligent Information Systems.

[4]  Simone Paolo Ponzetto,et al.  WikiRelate! Computing Semantic Relatedness Using Wikipedia , 2006, AAAI.

[5]  B. Magnini,et al.  Keyphrase Extraction for Summarization Purposes : The LAKE System at DUC-2004 , 2004 .

[6]  Carl Gutwin,et al.  Improving browsing in digital libraries with keyphrase indexes , 1999, Decis. Support Syst..

[7]  David Crystal,et al.  A dictionary of linguistics and phonetics , 1997 .

[8]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[9]  B. Magnini,et al.  A Keyphrase-Based Approach to Summarization : the LAKE System at DUC-2005 , 2005 .

[10]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[11]  Jonathan D. Cohen Highlights: language- and domain-independent automatic indexing terms for abstracting , 1995 .

[12]  Mohamed S. Kamel,et al.  CorePhrase: Keyphrase Extraction for Document Clustering , 2005, MLDM.

[13]  Ying Zhang,et al.  Mining Key Phrase Translations from Web Corpora , 2005, HLT.

[14]  Eduard H. Hovy,et al.  Question Answering in Webclopedia , 2000, TREC.

[15]  Donna K. Harman,et al.  Overview of the Fifth Text REtrieval Conference (TREC-5) , 1996, TREC.

[16]  Kamel Smaïli,et al.  Mining monolingual and bilingual corpora , 2010, Intell. Data Anal..

[17]  Dilek Z. Hakkani-Tür,et al.  A keyphrase based approach to interactive meeting summarization , 2008, 2008 IEEE Spoken Language Technology Workshop.

[18]  Lee-Feng Chien,et al.  PAT-tree-based keyword extraction for Chinese information retrieval , 1997, SIGIR '97.

[19]  Enrico Blanzieri,et al.  Keyphrases Extraction from Scientific Documents: Improving Machine Learning Approaches with Natural Language Processing , 2010, ICADL.

[20]  Clement T. Yu,et al.  A theory of term importance in automatic text analysis , 1974, J. Am. Soc. Inf. Sci..

[21]  Atsushi Imiya,et al.  Machine Learning and Data Mining in Pattern Recognition , 2013, Lecture Notes in Computer Science.

[22]  Bharath Dandala Graph-Based Keyphrase Extraction Using Wikipedia , 2010 .

[23]  Weiwei Huo Automatic Multi-word Term Extraction and its Application to Web-page Summarization , 2012 .

[24]  Laurent Romary,et al.  HUMB: Automatic Key Term Extraction from Scientific Articles in GROBID , 2010, *SEMEVAL.

[25]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[26]  Juan-Zi Li,et al.  Keyword Extraction Using Support Vector Machine , 2006, WAIM.

[27]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[28]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[29]  Mohammed J. Zaki Mining Non-Redundant Association Rules , 2004, Data Min. Knowl. Discov..

[30]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[31]  Frans Coenen,et al.  Statistical Identification of Key Phrases for Text Classification , 2007, MLDM.

[32]  Yi-fang Brook Wu,et al.  Domain-specific keyphrase extraction , 2005, CIKM '05.

[33]  Peter D. Turney Learning to Extract Keyphrases from Text , 2002, ArXiv.

[34]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[35]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[36]  Didier Bourigault,et al.  Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases , 1992, COLING.

[37]  Vibhu O. Mittal,et al.  OCELOT: a system for summarizing Web pages , 2000, SIGIR '00.

[38]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[39]  Li Su Research on Maximum Entropy Model for Keyword Indexing , 2004 .

[40]  Djoerd Hiemstra,et al.  Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval, University of Massachusetts Amherst, September 2002 , 2003, SIGF.

[41]  Iadh Ounis,et al.  Automatically Building a Stopword List for an Information Retrieval System , 2005, J. Digit. Inf. Manag..

[42]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[43]  Roland Kuhn,et al.  Phrase Clustering for Smoothing TM Probabilities - or, How to Extract Paraphrases from Phrase Tables , 2010, COLING.

[44]  José Gabriel Pereira Lopes,et al.  Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units , 1999, EPIA.

[45]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[46]  Chi-Hong Leung,et al.  A Statistical Learning Approach to Automatic Indexing of Controlled Index Terms , 1997, J. Am. Soc. Inf. Sci..

[47]  Christian Wartena,et al.  Keyword Extraction Using Word Co-occurrence , 2010, 2010 Workshops on Database and Expert Systems Applications.

[48]  Chengqi Zhang,et al.  Post-mining of Association Rules: Techniques for Effective Knowledge Extraction , 2009 .

[49]  Chiu-yu Tseng,et al.  Modeling Prosody of Mandarin Chinese Fluent Speech via Phrase Grouping , 2004 .

[50]  Amy J. C. Trappey,et al.  Development of a patent document classification and search platform using a back-propagation network , 2006, Expert Syst. Appl..

[51]  John D. Lafferty,et al.  A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval , 2017, SIGF.

[52]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[53]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[54]  W. Bruce Croft,et al.  A Language Modeling Approach to Information Retrieval , 1998, SIGIR Forum.

[55]  Dekang Lin,et al.  Phrase Clustering for Discriminative Learning , 2009, ACL.

[56]  Ilyas Cicekli,et al.  Using lexical chains for keyword extraction , 2007, Inf. Process. Manag..

[57]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[58]  StummeGerd,et al.  Computing iceberg concept lattices with TITANIC , 2002 .

[59]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[60]  Xiaojun Wan,et al.  Single Document Keyphrase Extraction Using Neighborhood Knowledge , 2008, AAAI.

[61]  Oren Etzioni,et al.  Clustering web documents: a phrase-based method for grouping search engine results , 1999 .

[62]  Julien Velcin,et al.  Topic Extraction for Ontology Learning , 2011 .

[63]  Xiaojun Wan,et al.  Exploiting neighborhood knowledge for single document summarization and keyphrase extraction , 2010, TOIS.

[64]  Y. Wang,et al.  Various Approaches in Text Pre-processing , 2004 .

[65]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[66]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[67]  Gerd Stumme,et al.  Conceptual Clustering with Iceberg Concept Lattices , 2001 .

[68]  W. Bruce Croft,et al.  The use of phrases and structured queries in information retrieval , 1991, SIGIR '91.

[69]  Evelyne Tzoukermann,et al.  Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax , 1997, ACL.

[70]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[71]  Joaquim Ferreira da Silva Extracting Multiword Terms from Document Collections , 1999 .

[72]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[73]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[74]  Ken Barker,et al.  Using Noun Phrase Heads to Extract Document Keyphrases , 2000, Canadian Conference on AI.

[75]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[76]  Torsten Zesch,et al.  Study of semantic relatedness of words using collaboratively constructed semantic resources , 2010 .

[77]  Rada Mihalcea,et al.  PageRank on Semantic Networks, with Application to Word Sense Disambiguation , 2004, COLING.

[78]  Ian H. Witten,et al.  Thesaurus based automatic keyphrase indexing , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).