Complex Network based Supervised Keyword Extractor

Abstract In this paper, we present a supervised framework for automatic keyword extraction from single document. We model the text as complex network, and construct the feature set by extracting select node properties from it. Several node properties have been exploited by unsupervised, graph-based keyword extraction methods to discriminate keywords from non-keywords. We exploit the complex interplay of node properties to design a supervised keyword extraction method. The training set is created from the feature set by assigning a label to each candidate keyword depending on whether the candidate is listed as a gold-standard keyword or not. Since the number of keywords in a document is much less than non-keywords, the curated training set is naturally imbalanced. We train a binary classifier to predict keywords after balancing the training set. The model is trained using two public datasets from scientific domain and tested using three unseen scientific corpora and one news corpus. Comparative study of the results with several recent keyword and keyphrase extraction methods establishes that the proposed method performs better in most cases. This substantiates our claim that graph-theoretic properties of words are effective discriminators between keywords and non-keywords. We support our argument by showing that the improved performance of the proposed method is statistically significant for all datasets. We also evaluate the effectiveness of the pre-trained model on Hindi and Assamese language documents. We observe that the model performs equally well for the cross-language text even though it was trained only on English language documents. This shows that the proposed method is independent of the domain, collection, and language of the training corpora.

[1]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[2]  Laurent Romary,et al.  HUMB: Automatic Key Term Extraction from Scientific Articles in GROBID , 2010, *SEMEVAL.

[3]  Pedro Carpena,et al.  Keyword detection in natural languages and DNA , 2002 .

[4]  Laurent Romary,et al.  GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains , 2010, LREC.

[5]  Vasudha Bhatnagar,et al.  sCAKE: Semantic Connectivity Aware Keyword Extraction , 2018, Inf. Sci..

[6]  Josiane Mothe,et al.  Automatic keyphrase extraction using graph-based methods , 2018, SAC.

[7]  Cornelia Caragea,et al.  A Comparison of Supervised Keyphrase Extraction Models , 2015, WWW.

[8]  Chengzhi Zhang,et al.  Automatic Keyword Extraction from Documents Using Conditional Random Fields , 2008 .

[9]  Mohammed J. Zaki Data Mining and Analysis: Fundamental Concepts and Algorithms , 2014 .

[10]  Michalis Vazirgiannis,et al.  A Graph Degeneracy-based Approach to Keyword Extraction , 2016, EMNLP.

[11]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[12]  Dan Klein,et al.  An Empirical Investigation of Statistical Significance in NLP , 2012, EMNLP.

[13]  Pedro A. Pury,et al.  Statistical keyword detection in literary corpora , 2007, ArXiv.

[14]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[15]  Jeffrey Heer,et al.  Replication of the Keyword Extraction part of the paper "'Without the Clutter of Unimportant Words': Descriptive Keyphrases for Text Visualization" , 2019, ArXiv.

[16]  Rotem Dror,et al.  The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing , 2018, ACL.

[17]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[18]  Peng Yang,et al.  Incorporating Expert Knowledge into Keyphrase Extraction , 2017, AAAI.

[19]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[20]  Ian H. Witten,et al.  Human-competitive tagging using automatic keyphrase extraction , 2009, EMNLP.

[21]  Peter D. Turney Learning to Extract Keyphrases from Text , 2002, ArXiv.

[22]  Maurizio Marchese,et al.  Large Dataset for Keyphrases Extraction , 2009 .

[23]  Xiaoli Li,et al.  MIKE: Keyphrase Extraction by Integrating Multidimensional Information , 2017, CIKM.

[24]  Cornelia Caragea,et al.  A Position-Biased PageRank Algorithm for Keyphrase Extraction , 2017, AAAI.

[25]  Grigorios Tsoumakas,et al.  Local word vectors guiding keyphrase extraction , 2018, Inf. Process. Manag..

[26]  Stephen B. Seidman,et al.  Network structure and minimum degree , 1983 .

[27]  Florian Boudin,et al.  A Comparison of Centrality Measures for Graph-Based Keyphrase Extraction , 2013, IJCNLP.

[28]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[29]  Thomas Demeester,et al.  Supervised Keyphrase Extraction as Positive Unlabeled Learning , 2016, EMNLP.

[30]  Christina Lioma,et al.  Graph-based term weighting for information retrieval , 2011, Information Retrieval.

[31]  Abraham Kandel,et al.  DegExt - A Language-Independent Graph-Based Keyphrase Extractor , 2011, AWIC.

[32]  Michalis Vazirgiannis,et al.  Main Core Retention on Graph-of-Words for Single-Document Keyword Extraction , 2015, ECIR.

[33]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[34]  Min-Yen Kan,et al.  Keyphrase Extraction in Scientific Publications , 2007, ICADL.

[35]  Cornelia Caragea,et al.  Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach , 2014, EMNLP.

[36]  A. Vespignani,et al.  The architecture of complex weighted networks. , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Florian Boudin,et al.  Unsupervised Keyphrase Extraction with Multipartite Graphs , 2018, NAACL.

[38]  Timothy Baldwin,et al.  SemEval-2010 Task 5 : Automatic Keyphrase Extraction from Scientific Articles , 2010, *SEMEVAL.

[39]  Jaime G. Carbonell,et al.  Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization , 2012, LREC.