Hierarchical geographical modeling of user locations from social media posts

With the availability of cheap location sensors, geotagging of messages in online social networks is proliferating. For instance, Twitter, Facebook, Foursquare, and Google+ provide these services both explicitly by letting users choose their location or implicitly via a sensor. This paper presents an integrated generative model of location and message content. That is, we provide a model for combining distributions over locations, topics, and over user characteristics, both in terms of location and in terms of their content preferences. Unlike previous work which modeled data in a flat pre-defined representation, our model automatically infers both the hierarchical structure over content and over the size and position of geographical locations. This affords significantly higher accuracy --- location uncertainty is reduced by 40% relative to the best previous results [21] achieved on location estimation from Tweets. We achieve this goal by proposing a new statistical model, the nested Chinese Restaurant Franchise (nCRF), a hierarchical model of tree distributions. Much statistical structure is shared between users. That said, each user has his own distribution over interests and places. The use of the nCRF allows us to capture the following effects: (1) We provide a topic model for Tweets; (2) We obtain location specific topics; (3) We infer a latent distribution of locations; (4) We provide a joint hierarchical model of topics and locations; (5) We infer personalized preferences over topics and locations within the above model. In doing so, we are both able to obtain accurate estimates of the location of a user based on his tweets and to obtain a detailed estimate of a geographical language model.

[1]  Shravan M. Narayanamurthy,et al.  Distributed large-scale natural graph factorization , 2013, WWW.

[2]  Alexander J. Smola,et al.  FastEx: Hash Clustering with Exponential Families , 2012, NIPS.

[3]  Jason Baldridge,et al.  Supervised Text-based Geolocation Using Language Models on an Adaptive Grid , 2012, EMNLP.

[4]  Alexander J. Smola,et al.  Discovering geographical topics in the twitter stream , 2012, WWW.

[5]  Alexander J. Smola,et al.  Scalable inference in latent variable models , 2012, WSDM '12.

[6]  Jure Leskovec,et al.  Friendship and mobility: user movement in location-based social networks , 2011, KDD.

[7]  Kyumin Lee,et al.  Exploring Millions of Footprints in Location Sharing Services , 2011, ICWSM.

[8]  E. Xing,et al.  Sparse Additive Generative Models of Text , 2011, ICML.

[9]  Jason Baldridge,et al.  Simple supervised document geolocation with geodesic grids , 2011, ACL.

[10]  E. Xing,et al.  Online Inference for the Infinite Topic-Cluster Model: Storylines from Streaming Text , 2011, AISTATS.

[11]  Alexander J. Smola,et al.  Unified analysis of streaming news , 2011, WWW.

[12]  Jiawei Han,et al.  Geographical topic discovery and comparison , 2011, WWW.

[13]  Michael I. Jordan,et al.  Tree-Structured Stick Breaking for Hierarchical Data , 2010, NIPS.

[14]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[15]  Lancelot F. James Coag-Frag duality for a class of stable Poisson-Kingman mixtures , 2010, 1008.2420.

[16]  E. Xing,et al.  Timeline: A Dynamic Hierarchical Dirichlet Process Model for Recovering Birth/Death and Evolution of Topics in Text Stream , 2010, UAI.

[17]  Jiang-Ming Yang,et al.  Equip tourists with knowledge mined from travelogues , 2010, WWW '10.

[18]  Sergej Sizov,et al.  GeoFolk: latent spatial semantics in web 2.0 social media , 2010, WSDM '10.

[19]  Chong Wang,et al.  Mining geographic knowledge using location aware topic model , 2007, GIR '07.

[20]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[21]  Michael I. Jordan,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[22]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[23]  Chao Liu,et al.  A probabilistic approach to spatiotemporal theme pattern mining on weblogs , 2006, WWW '06.

[24]  K. C. Chou,et al.  Multiscale recursive estimation, data fusion, and regularization , 1994, IEEE Trans. Autom. Control..

[25]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[26]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[27]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[28]  Philip J. Cowans Probabilistic Document Modelling , 2006 .

[29]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[30]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .