Hierarchical Discriminative Classification for Text-Based Geolocation

Text-based document geolocation is commonly rooted in language-based information retrieval techniques over geodesic grids. These methods ignore the natural hierarchy of cells in such grids and fall afoul of independence assumptions. We demonstrate the effectiveness of using logistic regression models on a hierarchy of nodes in the grid, which improves upon the state of the art accuracy by several percent and reduces mean error distances by hundreds of kilometers on data from Twitter, Wikipedia, and Flickr. We also show that logistic regression performs feature selection effectively, assigning high weights to geocentric terms.

[1]  David Yarowsky,et al.  Broadly Improving User Classification via Communication-Based Name and Location Clustering on Twitter , 2013, NAACL.

[2]  Stan Openshaw,et al.  Modifiable Areal Unit Problem , 2008, Encyclopedia of GIS.

[3]  Changhu Wang,et al.  Equip tourists with knowledge mined from travelogues , 2010, WWW '10.

[4]  Jeffrey Nichols,et al.  Where Is This Tweet From? Inferring Home Locations of Twitter Users , 2012, ICWSM.

[5]  Timothy Baldwin,et al.  Text-Based Twitter User Geolocation Prediction , 2014, J. Artif. Intell. Res..

[6]  Kathleen Fitzpatrick Anne Burdick, Johanna Drucker, Peter Lunenfeld, Todd Presner, & Jeffrey Schnapp, Digital_Humanities , 2014 .

[7]  Fotis Janndis,et al.  Digital Humanities , 2016, Inform. Spektrum.

[8]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[9]  Andrea Esuli,et al.  CoPhIR: a Test Collection for Content-Based Image Retrieval , 2009, ArXiv.

[10]  Alex A. Freitas,et al.  A survey of hierarchical classification across different application domains , 2010, Data Mining and Knowledge Discovery.

[11]  Paul N. Bennett,et al.  Refined experts: improving classification in large taxonomies , 2009, SIGIR.

[12]  Kyumin Lee,et al.  A content-driven framework for geolocating microblog users , 2013, TIST.

[13]  Gregory R. Crane,et al.  Disambiguating Geographic Names in a Historical Digital Library , 2001, ECDL.

[14]  J. Nocedal Updating Quasi-Newton Matrices With Limited Storage , 1980 .

[15]  Alexander J. Smola,et al.  Discovering geographical topics in the twitter stream , 2012, WWW.

[16]  Jorge Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[17]  Krzysztof Janowicz,et al.  On the Geo-Indicativeness of Non-Georeferenced Text , 2012, ICWSM.

[18]  Lars Backstrom,et al.  Find me if you can: improving geographical prediction with social and spatial proximity , 2010, WWW '10.

[19]  E. Skiles Document geolocation using language models built from lexical and geographic similarity , 2012 .

[20]  Alexander J. Smola,et al.  Hierarchical geographical modeling of user locations from social media posts , 2013, WWW.

[21]  Timothy Baldwin,et al.  Geolocation Prediction in Social Media Data by Finding Location Indicative Words , 2012, COLING.

[22]  C. E. Gehlke,et al.  Certain Effects of Grouping upon the Size of the Correlation Coefficient in Census Tract Material , 1934 .

[23]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[24]  Henry A. Kautz,et al.  Finding your friends and following them to where you are , 2012, WSDM '12.

[25]  B. Martins,et al.  A Language Modeling Approach for Georeferencing Textual Documents , 2012 .

[26]  Jon M. Kleinberg,et al.  Spatial variation in search engine queries , 2008, WWW.

[27]  James Caverlee,et al.  Location prediction in social media based on tie strength , 2013, CIKM.

[28]  David J. Unwin,et al.  Point Pattern Analysis , 2010 .

[29]  Timothy Baldwin,et al.  A Stacking-based Approach to Twitter User Geolocation Prediction , 2013, ACL.

[30]  Scott Sanner,et al.  Improving LDA topic models for microblogs via tweet pooling and automatic labeling , 2013, SIGIR.

[31]  Dongwon Lee,et al.  @Phillies Tweeting from Philly? Predicting Twitter User Locations with Spatial Word Usage , 2012, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

[32]  Pavel Serdyukov,et al.  Placing flickr photos on a map , 2009, SIGIR.

[33]  Sheila Kinsella,et al.  "I'm eating a sandwich in Glasgow": modeling locations with tweets , 2011, SMUC '11.

[34]  Luis Gravano,et al.  Computing Geographical Scopes of Web Resources , 2000, VLDB.

[35]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[36]  Cynthia J. Bannon,et al.  The Perseus project , 1991 .

[37]  Eric P. Xing,et al.  Sparse Additive Generative Models of Text , 2011, ICML.

[38]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[39]  Jason Baldridge,et al.  Supervised Text-based Geolocation Using Language Models on an Adaptive Grid , 2012, EMNLP.

[40]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[41]  Jason Baldridge,et al.  Simple supervised document geolocation with geodesic grids , 2011, ACL.

[42]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[43]  Ed H. Chi,et al.  Tweets from Justin Bieber's heart: the dynamics of the location field in user profiles , 2011, CHI.

[44]  Yiming Yang,et al.  Bayesian models for Large-scale Hierarchical Classification , 2012, NIPS.

[45]  Max Mühlhäuser,et al.  A Multi-Indicator Approach for Geolocalization of Tweets , 2013, ICWSM.

[46]  Kyumin Lee,et al.  You are where you tweet: a content-based approach to geo-locating twitter users , 2010, CIKM.

[47]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[48]  James Allan,et al.  An Investigation of Dirichlet Prior Smoothing's Performance Advantage , 2005 .

[49]  Vanessa Murdock,et al.  Modeling locations with social media , 2013, Information Retrieval.