Word statistics in Blogs and RSS feeds: Towards empirical universal evidence

We focus on the statistics of word occurrences and of the waiting times between such occurrences in Blogs. Due to the heterogeneity of words’ frequencies, the empirical analysis is performed by studying classes of “frequently-equivalent” words, i.e. by grouping words depending on their frequencies. Two limiting cases are considered: the dilute limit, i.e. for those words that are used less than once a day, and the dense limit for frequent words. In both cases, extreme events occur more frequently than expected from the Poisson hypothesis. These deviations from Poisson statistics reveal non-trivial time correlations between events that are associated with bursts of activities. The distribution of waiting times is shown to behave like a stretched exponential and to have the same shape for different sets of words sharing a common frequency, thereby revealing universal features.

[1]  Luciano Telesca,et al.  Are global terrorist attacks time-correlated? , 2006 .

[2]  Lada A. Adamic,et al.  Zipf's law and the Internet , 2002, Glottometrics.

[3]  M Ausloos,et al.  Uncovering collective listening habits and music genres in bipartite networks. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[4]  Tsu-Tan Fu,et al.  Analysis of Housewives' Grocery Shopping Behavior in Taiwan: An Application of the Poisson Switching Regression , 1997, Journal of Agricultural and Applied Economics.

[5]  Rudy Prabowo,et al.  Are raw RSS feeds suitable for broad issue scanning? A science concern case study , 2006, J. Assoc. Inf. Sci. Technol..

[6]  Anthony F. J. van Raan,et al.  Two-step competition process leads to quasi power-law income distributions , 2001 .

[7]  M Ausloos,et al.  Brownian particle having a fluctuating mass. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[8]  Sally Floyd,et al.  Wide-area traffic: the failure of Poisson modeling , 1994 .

[9]  Peter Richmond,et al.  Waiting time distributions in financial markets , 2002 .

[10]  A. Barabasi,et al.  Dynamics of information access on the web. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[11]  P. Gopikrishnan,et al.  Price fluctuations and market activity , 2001 .

[12]  Fang Wu,et al.  Novelty and collective attention , 2007, Proceedings of the National Academy of Sciences.

[13]  Ronald Rousseau,et al.  Lack of standardisation in informetric research. Comments on “Power laws of research output. Evidence for journals of economics” by Matthias Sutter and Martin G. Kocher , 2002, Scientometrics.

[14]  Werner Ebeling,et al.  Long-range correlations between letters and sentences in texts , 1995 .

[15]  Matthew Hurst,et al.  BlogPulse: Automated Trend Discovery for Weblogs , 2003 .

[16]  Marcelo A. Montemurro,et al.  Beyond the Zipf-Mandelbrot law in quantitative linguistics , 2001, ArXiv.

[17]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[18]  Anja Feldmann,et al.  An analysis of Internet chat systems , 2003, IMC '03.

[19]  Tim S. Evans,et al.  Exact solutions for network rewiring models , 2007 .

[20]  Christian Beck Dynamical Foundations of Nonextensive Statistical Mechanics , 2001 .

[21]  V. Paxson,et al.  WHERE MATHEMATICS MEETS THE INTERNET , 1998 .

[22]  C. Tsallis Possible generalization of Boltzmann-Gibbs statistics , 1988 .

[23]  Lucien Benguigui,et al.  FROM LOGNORMAL DISTRIBUTION TO POWER LAW: A NEW CLASSIFICATION OF THE SIZE DISTRIBUTIONS , 2006 .

[24]  R. Lambiotte,et al.  Activity ageing in growing networks , 2007, physics/0701157.

[25]  M. Ausloos,et al.  Growing network with j-redirection , 2007 .

[26]  M. Ausloos,et al.  Time-evolving distribution of time lags between commercial airline disasters , 2006 .

[27]  Mark Crovella,et al.  Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement , 2003 .

[28]  Ricard V. Solé,et al.  Two Regimes in the Frequency of Words and the Origins of Complex Lexicons: Zipf’s Law Revisited* , 2001, J. Quant. Linguistics.

[29]  Alexei Vázquez,et al.  Exact results for the Barabási model of human dynamics. , 2005, Physical review letters.

[30]  Albert-László Barabási,et al.  Modeling bursts and heavy tails in human dynamics , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[31]  S. Redner,et al.  Organization of growing random networks. , 2000, Physical review. E, Statistical, nonlinear, and soft matter physics.

[32]  Albert-László Barabási,et al.  The origin of bursts and heavy tails in human dynamics , 2005, Nature.

[33]  C. Cattuto,et al.  A Yule-Simon process with memory , 2006 .

[34]  P. Gregory Bayesian Logical Data Analysis for the Physical Sciences: The how-to of Bayesian inference , 2005 .

[35]  R. Lambiotte,et al.  Endo- vs. exogenous shocks and relaxation rates in book and music “sales” , 2005, physics/0509107.