OnSeS: A Novel Online Short Text Summarization Based on BM25 and Neural Network

The last decade has witnessed a dramatic growth of social networks, such as Twitter, Sina Microblog, etc. Messages/short texts on these platforms are generally of limited length, causing difficulties for machines to understand. Moreover, it is rarely possible for users to read and understand all the content due to the large quantity. So it is imperative to cluster and extract the viewpoints of these short texts. To solve this, the representation of a word is enriched with additional features from external, but it is demanding in terms of computational and time resources. In this paper, we proposed OnSeS, a novel short text summarization method which makes full use of word2vec to represent a word and utilizes neural network model to generate each word of the summary. OnSeS consists of three phrases: 1) clustering short texts using the K-means algorithm; 2) ranking content of each cluster by building a graph-based ranking model using BM25; 3) generating main point of each cluster with the help of neural machine translation model on the top ranked sentence. The experimental results reveal that our proposed fully data-driven approach outperforms state-of-the-art method.

[1]  nbspMr. Pruthviraj Parmar,et al.  Performance Analysis and Augmentation of K-means Clustering, based approach for Human Detection in Videos , 2015 .

[2]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[3]  T. Senthil Kumaran,et al.  An Energy Efficiency Distributed Routing Algorithm based on HAC Clustering Method for WSNs , 2014 .

[4]  J. Keziya Rani,et al.  Mining Opinion Features in Customer Reviews. , 2016 .

[5]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[6]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[7]  Qiang Zhou,et al.  A semantic approach for text clustering using WordNet and lexical chains , 2015, Expert Syst. Appl..

[8]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[9]  Fakhri Karray,et al.  Short-Text Clustering using Statistical Semantics , 2015, WWW.

[10]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[11]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[12]  Richi Nayak,et al.  Clustering and Labeling a Web Scale Document Collection using Wikipedia clusters , 2014, Web-KR '14.

[13]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[14]  Jianyong Wang,et al.  A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[15]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[16]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[17]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[18]  Sergey Brin,et al.  Reprint of: The anatomy of a large-scale hypertextual web search engine , 2012, Comput. Networks.

[19]  Qingcai Chen,et al.  LCSTS: A Large Scale Chinese Short Text Summarization Dataset , 2015, EMNLP.

[20]  Mohamed S. Kamel,et al.  Statistical semantics for enhancing document clustering , 2011, Knowledge and Information Systems.

[21]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[22]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[23]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[24]  Ani Nenkova,et al.  A Survey of Text Summarization Techniques , 2012, Mining Text Data.