Mining Quality Phrases from Massive Text Corpora

Text data are ubiquitous and play an essential role in big data applications. However, text data are mostly unstructured. Transforming unstructured text into structured units (e.g., semantically meaningful phrases) will substantially reduce semantic ambiguity and enhance the power and efficiency at manipulating such data using database technology. Thus mining quality phrases is a critical research problem in the field of databases. In this paper, we propose a new framework that extracts quality phrases from text corpora integrated with phrasal segmentation. The framework requires only limited training but the quality of phrases so generated is close to human judgment. Moreover, the method is scalable: both computation time and required space grow linearly as corpus size increases. Our experiments on large text corpora demonstrate the quality and efficiency of the new method.

[1]  E. F. Codd,et al.  A Relational Model for Large Shared Data Banks , 1970 .

[2]  E. F. Codd,et al.  A relational model of data for large shared data banks , 1970, CACM.

[3]  Hsin-Hsi Chen,et al.  Extracting Noun Phrases from Large-Scale Texts: A Hybrid Approach and its Automatic Evaluation , 1994, ACL.

[4]  Chilin Shih,et al.  A Stochastic Finite-State Word-Segmentation Algorithm for Chinese , 1994, ACL.

[5]  Katerina T. Frantzi,et al.  Automatic recognition of multi-word terms , 1998 .

[6]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[7]  Lee Gillam,et al.  University of Surrey Participation in TREC8: Weirdness Indexing for Logical Document Extrapolation and Retrieval (WILDER) , 1999, TREC.

[8]  Helena Ahonen Knowledge Discovery in Documents by Extracting Frequent Word Sequences , 1999, Libr. Trends.

[9]  Dan Roth,et al.  The Use of Classifiers in Sequential Inference , 2001, NIPS.

[10]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[11]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[12]  Changning Huang,et al.  A Unified Statistical Model for the Identification of English BaseNP , 2000, ACL.

[13]  Branimir Boguraev,et al.  Automatic Glossary Extraction: Beyond Terminology Identification , 2002, COLING.

[14]  B. Liu,et al.  Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[17]  Paul Deane,et al.  A Nonparametric Method for Extraction of Candidate Phrasal Terms , 2005, ACL.

[18]  Fernando Pereira,et al.  Non-Projective Dependency Parsing using Spanning Tree Algorithms , 2005, HLT.

[19]  Ian H. Witten,et al.  Thesaurus based automatic keyphrase indexing , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[20]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[21]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[22]  Berthold Reinwald,et al.  Multidimensional content eXploration , 2008, Proc. VLDB Endow..

[23]  Christopher D. Manning,et al.  Optimizing Chinese Word Segmentation for Machine Translation Performance , 2008, WMT@ACL.

[24]  Xavier Carreras,et al.  Simple Semi-supervised Dependency Parsing , 2008, ACL.

[25]  Ziqi Zhang,et al.  A Comparative Evaluation of Term Recognition Algorithms , 2008, LREC.

[26]  Fuchun Peng,et al.  Unsupervised query segmentation using generative language models and wikipedia , 2008, WWW.

[27]  William J. Byrne,et al.  Phrasal Segmentation Models for Statistical Machine Translation , 2008, COLING.

[28]  Jiawei Han,et al.  Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases , 2009, SDM.

[29]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.

[30]  Gerhard Weikum,et al.  Interesting-phrase mining for ad-hoc text analytics , 2010, Proc. VLDB Endow..

[31]  David Cohn,et al.  Active Learning , 2010, Encyclopedia of Machine Learning.

[32]  Carlos Ramisch,et al.  Multiword Expressions in the wild? The mwetoolkit comes in handy , 2010, COLING.

[33]  Vincent Ng,et al.  Conundrums in Unsupervised Keyphrase Extraction: Making Sense of the State-of-the-Art , 2010, COLING.

[34]  Aditya G. Parameswaran,et al.  Towards the web of concepts , 2010, Proc. VLDB Endow..

[35]  Nizar Y. Habash,et al.  Handbook of Natural Language Processing, Second Edition , 2010 .

[36]  Hiroshi Echizen-ya,et al.  Automatic Evaluation Method for Machine Translation Using Noun-Phrase Chunking , 2010, ACL.

[37]  Timothy Baldwin,et al.  Multiword Expressions , 2010, Handbook of Natural Language Processing.

[38]  Zhiyuan Liu,et al.  Automatic Keyphrase Extraction by Bridging Vocabulary Gap , 2011, CoNLL.

[39]  Armen E. Allahverdyan,et al.  Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs , 2011, NIPS.

[40]  ChengXiang Zhai,et al.  Unsupervised query segmentation using clickthrough for information retrieval , 2011, SIGIR '11.

[41]  Sebastian Michel,et al.  Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing , 2012, EDBT '12.

[42]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[43]  Jiawei Han,et al.  Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents , 2014, SDM.

[44]  Atreyee Dey,et al.  Fast Mining of Interesting Phrases from Subsets of Text Corpora , 2014, EDBT.

[45]  Clare R. Voss,et al.  Scalable Topical Phrase Mining from Text Corpora , 2014, Proc. VLDB Endow..