Treating Keywords as Outliers: A Keyphrase Extraction Approach

We propose a novel unsupervised keyphrase extraction approach that filters candidate keywords using outlier detection. It starts by training word embeddings on the target document to capture semantic regularities among the words. It then uses the minimum covariance determinant estimator to model the distribution of non-keyphrase word vectors, under the assumption that these vectors come from the same distribution, indicative of their irrelevance to the semantics expressed by the dimensions of the learned vector representation. Candidate keyphrases only consist of words that are detected as outliers of this dominant distribution. Empirical results show that our approach outperforms state-of-the-art and recent unsupervised keyphrase extraction methods.

[1]  Ian H. Witten,et al.  Human-competitive tagging using automatic keyphrase extraction , 2009, EMNLP.

[2]  S. Das Elements Of Artificial Neural Networks [Book Reviews] , 1998, IEEE Transactions on Neural Networks.

[3]  Timothy Baldwin,et al.  SemEval-2010 Task 5 : Automatic Keyphrase Extraction from Scientific Articles , 2010, *SEMEVAL.

[4]  Florian Boudin,et al.  TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction , 2013, IJCNLP.

[5]  Xiaojun Wan,et al.  Single Document Keyphrase Extraction Using Neighborhood Knowledge , 2008, AAAI.

[6]  M. Debruyne,et al.  Minimum covariance determinant , 2010 .

[7]  Zhiyuan Liu,et al.  Automatic Keyphrase Extraction via Topic Decomposition , 2010, EMNLP.

[8]  Hongxing He,et al.  Outlier Detection Using Replicator Neural Networks , 2002, DaWaK.

[9]  Marilyn Bohl,et al.  Information processing , 1971 .

[10]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[11]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[12]  Bernhard Schölkopf,et al.  Support Vector Method for Novelty Detection , 1999, NIPS.

[13]  Zhiyuan Liu,et al.  Clustering to Find Exemplar Terms for Keyphrase Extraction , 2009, EMNLP.

[14]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[15]  Cornelia Caragea,et al.  A Position-Biased PageRank Algorithm for Keyphrase Extraction , 2017, AAAI.

[16]  Seiichi Uchida,et al.  A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data , 2016, PloS one.

[17]  Florian Boudin,et al.  pke: an open source python-based keyphrase extraction toolkit , 2016, COLING.

[18]  Mia Hubert,et al.  Robust statistics for outlier detection , 2011, WIREs Data Mining Knowl. Discov..

[19]  Stephan Dreiseitl,et al.  Outlier Detection with One-Class SVMs: An Application to Melanoma Prognosis. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[20]  Maurizio Marchese,et al.  Large Dataset for Keyphrases Extraction , 2009 .

[21]  Grigorios Tsoumakas,et al.  Local word vectors guiding keyphrase extraction , 2018, Inf. Process. Manag..

[22]  Cornelia Caragea,et al.  Extracting Keyphrases from Research Papers Using Citation Networks , 2014, AAAI.

[23]  Florian Boudin,et al.  Unsupervised Keyphrase Extraction with Multipartite Graphs , 2018, NAACL.

[24]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[25]  Rui Wang,et al.  Using Word Embeddings to Enhance Keyword Identification for Scientific Publications , 2015, ADC.

[26]  Mia Hubert,et al.  Minimum covariance determinant and extensions , 2017, 1709.07045.

[27]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[28]  Vincent Ng,et al.  Automatic Keyphrase Extraction: A Survey of the State of the Art , 2014, ACL.

[29]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[30]  Cornelia Caragea,et al.  PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents , 2017, ACL.

[31]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[32]  Min-Yen Kan,et al.  Keyphrase Extraction in Scientific Publications , 2007, ICADL.

[33]  Don R. Hush,et al.  Network constraints and multi-objective optimization for one-class classification , 1996, Neural Networks.

[34]  Michalis Vazirgiannis,et al.  Main Core Retention on Graph-of-Words for Single-Document Keyword Extraction , 2015, ECIR.

[35]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[36]  P. Rousseeuw,et al.  A fast algorithm for the minimum covariance determinant estimator , 1999 .

[37]  Rui Wang Corpus-independent Generic Keyphrase Extraction Using Word Embedding Vectors , 2015 .

[38]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.