论文信息 - Crowdsourcing the character of a place: Character‐level convolutional networks for multilingual geographic text classification - 字舞流文

Crowdsourcing the character of a place: Character‐level convolutional networks for multilingual geographic text classification

This article presents a new character-level convolutional neural network model that can classify multilingual text written using any character set that can be encoded with UTF-8, a standard and widely used 8-bit character encoding. For geographic classification of text, we demonstrate that this approach is competitive with state-of-the-art word-based text classification methods. The model was tested on four crowdsourced data sets made up of Wikipedia articles, online travel blogs, Geonames toponyms, and Twitter posts. Unlike word-based methods, which require data cleaning and pre-processing, the proposed model works for any language without modification and with classification accuracy comparable to existing methods. Using a synthetic data set with introduced character-level errors, we show it is more robust to noise than word-level classification algorithms. The results indicate that UTF-8 character-level convolutional neural networks are a promising technique for georeferencing noisy text, such as found in colloquial social media posts and texts scanned with optical character recognition. However, word-based methods currently require less computation time to train, so are currently preferable for classifying well-formatted and cleaned texts in single languages. Keywords— crowdsourcing, convolutional neural networks, text classification, geoparsing, geographic information retrieval, user-generated content

Benjamin Adams | Grant McKenzie | Grant McKenzie | B. Adams

[1] Stan Matwin,et al. Feature Engineering for Text Classification , 1999, ICML.

[2] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[3] Clément Farabet,et al. Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[4] Jason Baldridge,et al. Gazetteer-Independent Toponym Resolution Using Geographic Word Profiles , 2015, AAAI.

[5] Ron Sivan,et al. Web-a-where: geotagging web content , 2004, SIGIR '04.

[6] Guillaume Lample,et al. Neural Architectures for Named Entity Recognition , 2016, NAACL.

[7] Michael F. Goodchild,et al. The convergence of GIS and social media: challenges for GIScience , 2011, Int. J. Geogr. Inf. Sci..

[8] Shourya Roy,et al. A survey of types of text noise and techniques to handle noisy text , 2009, AND '09.

[9] Xiang Zhang,et al. Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[10] Mário J. Silva,et al. Adding geographic scopes to web resources , 2006, Comput. Environ. Urban Syst..

[11] Benjamin Adams,et al. Wāhi, a discrete global grid gazetteer built using linked open data , 2017, Int. J. Digit. Earth.

[12] Philip Resnik,et al. OCR error correction using a noisy channel model , 2002 .

[13] Junwei Han,et al. Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images , 2016, IEEE Transactions on Geoscience and Remote Sensing.

[14] Dirk Hovy,et al. User Review Sites as a Resource for Large-Scale Sociolinguistic Studies , 2015, WWW.

[15] Diana Inkpen,et al. Estimating User Location in Social Media with Stacked Denoising Auto-encoders , 2015, VS@HLT-NAACL.

[16] Pierre Alliez,et al. Convolutional Neural Networks for Large-Scale Remote-Sensing Image Classification , 2017, IEEE Transactions on Geoscience and Remote Sensing.

[17] Clodoveu A. Davis,et al. A survey on the geographic scope of textual documents , 2016, Comput. Geosci..

[18] Kyumin Lee,et al. You are where you tweet: a content-based approach to geo-locating twitter users , 2010, CIKM.

[19] D. Boyd,et al. CRITICAL QUESTIONS FOR BIG DATA , 2012 .

[20] Stefan M. Rüger,et al. Using co‐occurrence models for placename disambiguation , 2008, Int. J. Geogr. Inf. Sci..

[21] Yoon Kim,et al. Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[22] Mark Gahegan,et al. Frankenplace: Interactive Thematic Mapping for Ad Hoc Exploratory Search , 2015, WWW.

[23] Francois Yergeau. UTF-8, a transformation format of ISO 10646 , 1998, RFC.

[24] Alexei A. Efros,et al. IM2GPS: estimating geographic information from a single image , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[25] Christopher B. Jones,et al. Geographical information retrieval , 2008, Int. J. Geogr. Inf. Sci..

[26] Christopher M. Danforth,et al. The Geography of Happiness: Connecting Twitter Sentiment and Expression, Demographics, and Objective Characteristics of Place , 2013, PloS one.

[27] Andrew McCallum,et al. A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[28] Paul D. Clough. Extracting metadata for spatially-aware information retrieval on the internet , 2005, GIR '05.

[29] Jefersson Alex dos Santos,et al. Towards better exploiting convolutional neural networks for remote sensing scene classification , 2016, Pattern Recognit..

[30] Timothy Baldwin,et al. Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[31] Krzysztof Janowicz,et al. On the Geo-Indicativeness of Non-Georeferenced Text , 2012, ICWSM.

[32] Yutaka Matsuo,et al. Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[33] J. Snyder. An Equal-Area Map Projection For Polyhedral Globes , 1992 .

[34] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[35] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[36] Bruno Martins,et al. Automated Geocoding of Textual Documents: A Survey of Current Approaches , 2017, Trans. GIS.

[37] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[38] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Judith Gelernter,et al. Geocoding location expressions in Twitter messages: A preference learning method , 2014, J. Spatial Inf. Sci..

[40] Judith Gelernter,et al. Geo‐parsing Messages from Microtext , 2011, Trans. GIS.

[41] M. Goodchild. Citizens as sensors: the world of volunteered geography , 2007 .

[42] Alexander M. Rush,et al. Character-Aware Neural Language Models , 2015, AAAI.

[43] Lawrence D. Jackel,et al. Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[44] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[45] Changhu Wang,et al. Equip tourists with knowledge mined from travelogues , 2010, WWW '10.

[46] Jens Lehmann,et al. DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[47] Steven Schockaert,et al. Georeferencing Wikipedia Documents Using Data from Social Media Sources , 2014, ACM Trans. Inf. Syst..

[48] Brendan T. O'Connor,et al. A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[49] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[50] Megha Agrawal,et al. Characterizing Geographic Variation in Well-Being Using Tweets , 2013, ICWSM.

[51] Jürgen Schmidhuber,et al. Deep learning in neural networks: An overview , 2014, Neural Networks.

[52] Robert Weibel,et al. Spatial information retrieval and geographical ontologies an overview of the SPIRIT project , 2002, SIGIR '02.

[53] Gerard Salton,et al. A vector space model for automatic indexing , 1975, CACM.

[54] Timothy Baldwin,et al. Text-Based Twitter User Geolocation Prediction , 2014, J. Artif. Intell. Res..

[55] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[56] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[57] Kazutoshi Sumiya,et al. Measuring geographical regularities of crowd behaviors for Twitter-based geo-social event detection , 2010, LBSN '10.

[58] Jie Yin,et al. Using Social Media to Enhance Emergency Situation Awareness , 2012, IEEE Intelligent Systems.

[59] Jason Baldridge,et al. Simple supervised document geolocation with geodesic grids , 2011, ACL.

[60] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[61] Hema Raghavan,et al. Discovering users' specific geo intention in web search , 2009, WWW '09.

[62] Ed H. Chi,et al. Tweets from Justin Bieber's heart: the dynamics of the location field in user profiles , 2011, CHI.

[63] Rodrigo C. Barros,et al. A character-based convolutional neural network for language-agnostic Twitter sentiment analysis , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[64] Constantine D. Spyropoulos,et al. An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[65] Krzysztof Janowicz,et al. Things and Strings: Improving Place Name Disambiguation from Short Texts by Combining Entity Co-Occurrence with Topic Modeling , 2016, EKAW.