LSIF: A System for Large-Scale Information Flow Detection Based on Topic-Related Semantic Similarity Measurement

Information flow detection is dedicated to tracking the dynamics and evolution of Web information spreading across the entire web over time. How to choose a comfortable information granularity to detect and how to track information evolution from one to another are the main challenges. Besides, the technological problem of doing that with a large scale information efficiently is yet to be solved. In this paper, we propose a system approach (LSIF) for a large-scale topic-related semantic information flow detection. We view the sentence as the basic information unit. Moreover, we represent a word or a sentence as continuous high-dimensional vector, which is used for semantic similarity measurement, with the help of word embedding and Fisher kernel. To handle the large-scale information efficiently, we propose a dimension reduction framework called Random Reference Reduction (3R). Furthermore, we adopt a novel clustering algorithm to extract meme -- a piece of information and its variants and analyze how memes evolve. We demonstrate the effectiveness of our approach on two terabyte-level datasets. One is the dataset used by some previous researchers, on which we conducted a series of experiments to evaluate performance. The result shows that our approach is more effective and more efficient comparing with the state-of-the-art methods. The other one is 5 terabyte dataset crawled from 20 Chinese news sites. We visualize the detection results of information flow and exact 9 million memes from the Chinese dataset, which spend about two days.

[1]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[2]  Zi Huang,et al.  A temporal context-aware model for user behavior modeling in social media systems , 2014, SIGMOD Conference.

[3]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[6]  Rok Sosic,et al.  NIFTY: a system for large scale information flow tracking and clustering , 2013, WWW.

[7]  Beng Chin Ooi,et al.  Big data: the driver for innovation in databases , 2014 .

[8]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[9]  Xihong Wu,et al.  Text Segmentation with LDA-Based Fisher Kernel , 2008, ACL.

[10]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.

[11]  Devdatt P. Dubhashi,et al.  Extractive Summarization using Continuous Vector Space Models , 2014, CVSC@EACL.

[12]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[13]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[14]  Xuanjing Huang,et al.  Continuous word embeddings for detecting local text reuses at the semantic level , 2014, SIGIR.

[15]  Filippo Menczer,et al.  Clustering memes in social media , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[16]  S. Dumais Latent Semantic Analysis. , 2005 .

[17]  Hua Lu,et al.  A unified model for stable and temporal topic detection from social media data , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[18]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[19]  Ming Zhou,et al.  Coooolll: A Deep Learning System for Twitter Sentiment Classification , 2014, *SEMEVAL.

[20]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[21]  Xiaohui Hu,et al.  IdeaGraph Plus: A Topic-Based Algorithm for Perceiving Unnoticed Events , 2013, 2013 IEEE 13th International Conference on Data Mining Workshops.

[22]  Zhe Chen,et al.  A Parallel Implementation of Idea Graph to Extract Rare Chances from Big Data , 2014, 2014 IEEE International Conference on Data Mining Workshop.

[23]  Shih-Fu Chang,et al.  Spherical hashing , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Xiaohui Hu,et al.  IdeaGraph: A Graph-Based Algorithm of Mining Latent Information for Human Cognition , 2013, 2013 IEEE International Conference on Systems, Man, and Cybernetics.

[25]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..