Automated Phrase Mining from Massive Text Corpora

As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, <inline-formula> <tex-math notation="LaTeX">$\mathsf{AutoPhrase}$</tex-math><alternatives> <inline-graphic xlink:href="shang-ieq1-2812203.gif"/></alternatives></inline-formula>, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, <inline-formula><tex-math notation="LaTeX"> $\mathsf{AutoPhrase}$</tex-math><alternatives><inline-graphic xlink:href="shang-ieq2-2812203.gif"/></alternatives> </inline-formula> has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, <inline-formula><tex-math notation="LaTeX">$\mathsf{AutoPhrase}$ </tex-math><alternatives><inline-graphic xlink:href="shang-ieq3-2812203.gif"/></alternatives></inline-formula> can be extended to model single-word quality phrases.

[1]  Bin Wang,et al.  Efficiently Mining High Quality Phrases from Texts , 2017, AAAI.

[2]  ChengXiang Zhai,et al.  Noun-Phrase Analysis in Unrestricted Text for Information Retrieval , 1996, ACL.

[3]  Jiawei Han,et al.  Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases , 2009, SDM.

[4]  Carlos Ramisch,et al.  Multiword Expressions in the wild? The mwetoolkit comes in handy , 2010, COLING.

[5]  Clare R. Voss,et al.  Scalable Topical Phrase Mining from Text Corpora , 2014, Proc. VLDB Endow..

[6]  Lee Gillam,et al.  University of Surrey Participation in TREC8: Weirdness Indexing for Logical Document Extrapolation and Retrieval (WILDER) , 1999, TREC.

[7]  Changning Huang,et al.  A Unified Statistical Model for the Identification of English BaseNP , 2000, ACL.

[8]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[9]  Katerina T. Frantzi,et al.  Automatic recognition of multi-word terms , 1998 .

[10]  Dan Roth,et al.  The Use of Classifiers in Sequential Inference , 2001, NIPS.

[11]  Fernando Pereira,et al.  Non-Projective Dependency Parsing using Spanning Tree Algorithms , 2005, HLT.

[12]  Timothy Baldwin,et al.  Multiword Expressions , 2010, Handbook of Natural Language Processing.

[13]  Paul Deane,et al.  A Nonparametric Method for Extraction of Candidate Phrasal Terms , 2005, ACL.

[14]  Jiawei Han,et al.  Mining Quality Phrases from Massive Text Corpora , 2015, SIGMOD Conference.

[15]  Armen E. Allahverdyan,et al.  Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs , 2011, NIPS.

[16]  Hsin-Hsi Chen,et al.  Extracting Noun Phrases from Large-Scale Texts: A Hybrid Approach and its Automatic Evaluation , 1994, ACL.

[17]  Geoffrey Finch,et al.  Linguistic terms and concepts , 1999 .

[18]  Bin Wang,et al.  CITPM: A Cluster-Based Iterative Topical Phrase Mining Framework , 2016, DASFAA.

[19]  Ahmad Nickabadi,et al.  TSAKE: A topical and structural automatic keyphrase extractor , 2017, Appl. Soft Comput..

[20]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[21]  Aditya G. Parameswaran,et al.  Towards the web of concepts , 2010, Proc. VLDB Endow..

[22]  Ziqi Zhang,et al.  A Comparative Evaluation of Term Recognition Algorithms , 2008, LREC.

[23]  Leo Breiman,et al.  Randomizing Outputs to Increase Prediction Accuracy , 2000, Machine Learning.

[24]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[25]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[26]  Sebastian Michel,et al.  Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing , 2012, EDBT '12.

[27]  Atreyee Dey,et al.  Fast Mining of Interesting Phrases from Subsets of Text Corpora , 2014, EDBT.

[28]  References , 1971 .

[29]  Jiawei Han,et al.  Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents , 2014, SDM.

[30]  Roger Levy,et al.  Is it Harder to Parse Chinese, or the Chinese Treebank? , 2003, ACL.

[31]  Zhiyuan Liu,et al.  Automatic Keyphrase Extraction by Bridging Vocabulary Gap , 2011, CoNLL.

[32]  Branimir Boguraev,et al.  Automatic Glossary Extraction: Beyond Terminology Identification , 2002, COLING.

[33]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[34]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.

[35]  Vincent Ng,et al.  Conundrums in Unsupervised Keyphrase Extraction: Making Sense of the State-of-the-Art , 2010, COLING.

[36]  Xavier Carreras,et al.  Simple Semi-supervised Dependency Parsing , 2008, ACL.

[37]  Gonzalo Martínez-Muñoz,et al.  Switching class labels to generate classification ensembles , 2005, Pattern Recognit..

[38]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[39]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[40]  Gerhard Weikum,et al.  Interesting-phrase mining for ad-hoc text analytics , 2010, Proc. VLDB Endow..