Interesting-phrase mining for ad-hoc text analytics

Large text corpora with news, customer mail and reports, or Web 2.0 contributions offer a great potential for enhancing business-intelligence applications. We propose a framework for performing text analytics on such data in a versatile, efficient, and scalable manner. While much of the prior literature has emphasized mining keywords or tags in blogs or social-tagging communities, we emphasize the analysis of interesting phrases. These include named entities, important quotations, market slogans, and other multi-word phrases that are prominent in a dynamically derived ad-hoc subset of the corpus, e.g., being frequent in the subset but relatively infrequent in the overall corpus. We develop preprocessing and indexing methods for phrases, paired with new search techniques for the top-k most interesting phrases in ad-hoc subsets of the corpus. Our framework is evaluated using a large-scale real-world corpus of New York Times news articles.

[1]  Ramakrishnan Srikant,et al.  Discovering Trends in Text Databases , 1997, KDD.

[2]  Ravi Kumar,et al.  Visualizing tags over time , 2006, WWW '06.

[3]  Ronald Fagin,et al.  Multi-structural databases , 2005, PODS '05.

[4]  Yin Yang,et al.  Query by document , 2009, WSDM '09.

[5]  Helena Ahonen Knowledge Discovery in Documents by Extracting Frequent Word Sequences , 1999, Libr. Trends.

[6]  Ronald Fagin,et al.  Efficient Implementation of Large-Scale Multi-Structural Databases , 2005, VLDB.

[7]  Berthold Reinwald,et al.  Multidimensional content eXploration , 2008, Proc. VLDB Endow..

[8]  Owen Kaser,et al.  Analyzing Large Collections of Electronic Text Using OLAP , 2006, ArXiv.

[9]  Nick Koudas,et al.  BlogScope: A System for Online Analysis of High Volume Text Streams , 2007, VLDB.

[10]  Hamid Pirahesh,et al.  Document-Centric OLAP in the Schema-Chaos World , 2006, BIRTE.

[11]  Ryen W. White,et al.  Using top-ranking sentences to facilitate effective information access: Book Reviews , 2005 .

[12]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[13]  Marti A. Hearst Clustering versus faceted categories for information exploration , 2006, Commun. ACM.

[14]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[15]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[16]  Stephen E. Robertson,et al.  Okapi/Keenbow at TREC-8 , 1999, TREC.

[17]  Bo Zhao,et al.  Text Cube: Computing IR Measures for Multidimensional Text Database Analysis , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[18]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.

[19]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[20]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[21]  Philip S. Yu,et al.  Direct Discriminative Pattern Mining for Effective Classification , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[22]  James Bailey,et al.  Mining minimal distinguishing subsequence patterns with gap constraints , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[23]  Jun Rao,et al.  Dynamic faceted search for discovery-driven analysis , 2008, CIKM '08.

[24]  Ryen W. White,et al.  Using top-ranking sentences to facilitate effective information access , 2005, J. Assoc. Inf. Sci. Technol..

[25]  Eugene J. Shekita,et al.  Beyond basic faceted search , 2008, WSDM '08.

[26]  Koichi Takeda,et al.  A method for online analytical processing of text data , 2007, CIKM '07.