Combining Deep Learning and String Kernels for the Localization of Swiss German Tweets

In this work, we introduce the methods proposed by the UnibucKernel team in solving the Social Media Variety Geolocation task featured in the 2020 VarDial Evaluation Campaign. We address only the second subtask, which targets a data set composed of nearly 30 thousand Swiss German Jodels. The dialect identification task is about accurately predicting the latitude and longitude of test samples. We frame the task as a double regression problem, employing a variety of machine learning approaches to predict both latitude and longitude. From simple models for regression, such as Support Vector Regression, to deep neural networks, such as Long Short-Term Memory networks and character-level convolutional neural networks, and, finally, to ensemble models based on meta-learners, such as XGBoost, our interest is focused on approaching the problem from a few different perspectives, in an attempt to minimize the prediction error. With the same goal in mind, we also considered many types of features, from high-level features, such as BERT embeddings, to low-level features, such as characters n-grams, which are known to provide good results in dialect identification. Our empirical results indicate that the handcrafted model based on string kernels outperforms the deep learning approaches. Nevertheless, our best performance is given by the ensemble model that combines both handcrafted and deep learning models.

[1]  Radu Tudor Ionescu,et al.  MOROCO: The Moldavian and Romanian Dialectal Corpus , 2019, ACL.

[2]  Noah Constant,et al.  Character-Level Language Modeling with Deeper Self-Attention , 2018, AAAI.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Eran Yahav,et al.  On the Practical Computational Power of Finite Precision RNNs for Language Recognition , 2018, ACL.

[5]  Timothy Baldwin,et al.  A Neural Model for User Geolocation and Lexical Dialectology , 2017, ACL.

[6]  Alexander J. Smola,et al.  Hierarchical geographical modeling of user locations from social media posts , 2013, WWW.

[7]  Xing Xie,et al.  An efficient location extraction algorithm by leveraging web contextual information , 2010, GIS '10.

[8]  Hanan Samet,et al.  Determining the spatial reader scopes of news sources using local lexicons , 2010, GIS '10.

[9]  Radu Tudor Ionescu,et al.  HASKER: An efficient algorithm for string kernels. Application to polarity classification in various languages , 2017, KES.

[10]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[11]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[12]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[13]  Radu Tudor Ionescu,et al.  Automated essay scoring with string kernels and word embeddings , 2018, ACL.

[14]  Christopher J. C. Burges,et al.  From RankNet to LambdaRank to LambdaMART: An Overview , 2010 .

[15]  Radu Tudor Ionescu,et al.  UnibucKernel Reloaded: First Place in Arabic Dialect Identification for the Second Year in a Row , 2018, VarDial@COLING 2018.

[16]  Yee Whye Teh,et al.  Lossless Compression Based on the Sequence Memoizer , 2010, 2010 Data Compression Conference.

[17]  Traian Rebedea,et al.  Sentence selection with neural networks using string kernels , 2017, KES.

[18]  Aoife Cahill,et al.  String Kernels for Native Language Identification: Insights from Behind the Curtains , 2016, CL.

[19]  Judith Gelernter,et al.  Geo‐parsing Messages from Microtext , 2011, Trans. GIS.

[20]  Kalina Bontcheva,et al.  Where's @wally?: a classification approach to geolocating users based on their social ties , 2013, HT.

[21]  Benedikt Szmrecsanyi,et al.  Corpus-based Dialectometry: Aggregate Morphosyntactic Variability in British English Dialects , 2008, Int. J. Humanit. Arts Comput..

[22]  Alexander J. Smola,et al.  Discovering geographical topics in the twitter stream , 2012, WWW.

[23]  Nikola Ljubesic,et al.  TweetGeo - A Tool for Collecting, Processing and Analysing Geo-encoded Linguistic Data , 2016, COLING.

[24]  Radu Tudor Ionescu,et al.  The Story of the Characters, the DNA and the Native Language , 2013, BEA@NAACL-HLT.

[25]  Kyumin Lee,et al.  You are where you tweet: a content-based approach to geo-locating twitter users , 2010, CIKM.

[26]  Nicu Sebe,et al.  Non-linear Neurons with Human-like Apical Dendrite Activations , 2020, ArXiv.

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Joaquin Quiñonero Candela,et al.  Practical Lessons from Predicting Clicks on Ads at Facebook , 2014, ADKDD'14.

[29]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[30]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[31]  Radu Tudor Ionescu,et al.  Learning to Identify Arabic and German Dialects using Multiple Kernels , 2017, VarDial.

[32]  Luis Gravano,et al.  Computing Geographical Scopes of Web Resources , 2000, VLDB.

[33]  Radu Tudor Ionescu,et al.  Improving the results of string kernels in sentiment analysis and Arabic dialect identification by adapting them to your test set , 2018, EMNLP.

[34]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[35]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[36]  Sheila Kinsella,et al.  "I'm eating a sandwich in Glasgow": modeling locations with tweets , 2011, SMUC '11.

[37]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[38]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[39]  Stefan Trausan-Matu,et al.  SC-UPB at the VarDial 2019 Evaluation Campaign: Moldavian vs. Romanian Cross-Dialect Topic Identification , 2019, Proceedings of the Sixth Workshop on.

[40]  Radu Tudor Ionescu,et al.  The Unreasonable Effectiveness of Machine Learning in Moldavian versus Romanian Dialect Identification , 2020, International Journal of Intelligent Systems.

[41]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[42]  Dirk Hovy,et al.  Capturing Regional Variation with Distributed Place Representations and Geographic Retrofitting , 2018, EMNLP.

[43]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[44]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[45]  James Bennett,et al.  The Netflix Prize , 2007 .

[46]  D. Tudoreanu DTeam @ VarDial 2019: Ensemble based on skip-gram and triplet loss neural networks for Moldavian vs. Romanian cross-dialect topic identification , 2019, Proceedings of the Sixth Workshop on.

[47]  Patrice Enjalbert,et al.  Geographic reference analysis for geographic document querying , 2003, HLT-NAACL 2003.

[48]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[49]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[50]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[51]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[52]  Radu Tudor Ionescu,et al.  UnibucKernel: A kernel-based learning method for complex word identification , 2018, BEA@NAACL-HLT.

[53]  Paolo Rosso,et al.  Single and Cross-domain Polarity Classification using String Kernels , 2017, EACL.

[54]  Dirk Hovy,et al.  A Report on the VarDial Evaluation Campaign 2020 , 2020, VARDIAL.

[55]  Taylor Jones Toward a Description of African American Vernacular English Dialect Regions Using “Black Twitter” , 2015 .

[56]  Noah A. Smith,et al.  Improved Transition-based Parsing by Modeling Characters instead of Words with LSTMs , 2015, EMNLP.

[57]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[58]  PAUL J. WERBOS,et al.  Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[59]  Timothy Baldwin,et al.  Text-Based Twitter User Geolocation Prediction , 2014, J. Artif. Intell. Res..

[60]  Chih-Jen Lin,et al.  Training v-Support Vector Regression: Theory and Algorithms , 2002, Neural Computation.

[61]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[62]  Jason Baldridge,et al.  Supervised Text-based Geolocation Using Language Models on an Adaptive Grid , 2012, EMNLP.

[63]  Radu Tudor Ionescu,et al.  Can string kernels pass the test of time in Native Language Identification? , 2017, BEA@EMNLP.

[64]  Aoife Cahill,et al.  Can characters reveal your native language? A language-independent approach to native language identification , 2014, EMNLP.

[65]  Radu Tudor Ionescu,et al.  UnibucKernel: An Approach for Arabic Dialect Identification Based on Multiple String Kernels , 2016, VarDial@COLING.

[66]  Ping Li,et al.  Robust LogitBoost and Adaptive Base Class (ABC) LogitBoost , 2010, UAI.

[67]  Diansheng Guo,et al.  Understanding U.S. regional linguistic variation with Twitter data analysis , 2016, Comput. Environ. Urban Syst..

[68]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[69]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[70]  Jason Baldridge,et al.  Simple supervised document geolocation with geodesic grids , 2011, ACL.

[71]  R. Baayen,et al.  Quantitative Social Dialectology: Explaining Linguistic Variation Geographically and Socially , 2011, PloS one.

[72]  Hanan Samet,et al.  Geotagging with local lexicons to build indexes for textually-specified spatial data , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[73]  Gabriel Doyle,et al.  Mapping Dialectal Variation by Querying Social Media , 2014, EACL.

[74]  Yee Whye Teh,et al.  A stochastic memoizer for sequence data , 2009, ICML '09.