Content coverage maximization on word networks for hierarchical topic summarization

This paper studies text summarization by extracting hierarchical topics from a given collection of documents. We propose a new approach of text modeling via network analysis. We convert documents into a word influence network, and find the words summarizing the major topics with an efficient influence maximization algorithm. Besides, the influence capability of the topic words on other words in the network reveal the relations among the topic words. Then we cluster the words and build hierarchies for the topics. Experiments on large collections of Web documents show that a simple method based on the influence analysis is effective, compared with existing generative topic modeling and random walk based ranking.

[1]  Haixun Wang,et al.  Automatic taxonomy construction from keywords , 2012, KDD.

[2]  Andreas Krause,et al.  Cost-effective outbreak detection in networks , 2007, KDD '07.

[3]  Ariel Fuxman,et al.  Using the wisdom of the crowds for keyword generation , 2008, WWW.

[4]  Wei Chen,et al.  Efficient influence maximization in social networks , 2009, KDD.

[5]  Wei Li,et al.  Mixtures of hierarchical topics with Pachinko allocation , 2007, ICML '07.

[6]  Jimeng Sun,et al.  Social influence analysis in large-scale networks , 2009, KDD.

[7]  Éva Tardos,et al.  Maximizing the Spread of Influence through a Social Network , 2015, Theory Comput..

[8]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[9]  Shui-Lung Chuang,et al.  A practical web-based approach to generating topic hierarchy for text segments , 2004, CIKM '04.

[10]  Matthew Richardson,et al.  Mining knowledge-sharing sites for viral marketing , 2002, KDD.

[11]  Dragomir R. Radev A Common Theory of Information Fusion from Multiple Text Sources Step One: Cross-Document Structure , 2000, SIGDIAL Workshop.

[12]  Jure Leskovec,et al.  Inferring networks of diffusion and influence , 2010, KDD.

[13]  Matthew Richardson,et al.  Mining the network value of customers , 2001, KDD '01.

[14]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies , 2000, ArXiv.

[15]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[16]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[17]  Xiaojun Wan,et al.  Exploiting neighborhood knowledge for single document summarization and keyphrase extraction , 2010, TOIS.

[18]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[19]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[20]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[21]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[22]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[23]  Yinan Zhang,et al.  A phrase mining framework for recursive construction of a topical hierarchy , 2013, KDD.

[24]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[25]  Wei Chen,et al.  Scalable influence maximization for independent cascade model in large-scale social networks , 2012, Data Mining and Knowledge Discovery.

[26]  Xiaojun Wan,et al.  Towards an Iterative Reinforcement Approach for Simultaneous Document Summarization and Keyword Extraction , 2007, ACL.

[27]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[28]  Masahiro Kimura,et al.  Tractable Models for Information Diffusion in Social Networks , 2006, PKDD.

[29]  Laks V. S. Lakshmanan,et al.  Learning influence probabilities in social networks , 2010, WSDM '10.

[30]  Manuel de Buenaga Rodríguez,et al.  Using WordNet to Complement Training Information in Text Categorization , 1997, ArXiv.