From Shakespeare to Twitter: What are Language Styles all about?

As natural language processing research is growing and largely driven by the availability of data, we expanded research from news and small-scale dialog corpora to web and social media. User-generated data and crowdsourcing opened the door for investigating human language of various styles with more statistical power and real-world applications. In this position/survey paper, I will review and discuss seven language styles that I believe to be important and interesting to study: influential work in the past, challenges at the present, and potential impact for the future. 1 Top Three Problems The top three problems for studying language styles are data, data and data. More specifically, they are data shortage, data fusion, and data annotation problems. The data shortage problem has been improving, which is the main reason that there is surge in the number of research studies on language styles. The data fusion problem is more specific to the area, due to the subtle and often subjective nature of linguistic styles. For instance, while men and women talk in different ways (note this is not the same as talking about different things), they also talk about a lot of things in an indistinguishable way. Moreover, there is also a huge variance from one man to another, one woman to another. The styles are often fused together in the data and not easy to separate out or make black-and-white judgements on. This also leads to challenges in data annotation or data collection, comparing to other NLP tasks (e.g. question answering). Throughout the rest of this paper, we shall see many creative solutions, interesting work, and promising potential. 2 Seven Styles of Language Disclaimers: (i) We discuss primarily in the context of natural language processing research; (ii) There are certainly more than seven language styles as there are more than seven wonders in the world. 2.1 Simple and Short Text simplification is one of the earliest topics in computational linguistics that directly deals with language styles, rewriting regular texts into simpler versions for people with limited reading capabilities. The major transition from rule-based to machine learning approach for automatic sentence simplification did not happen until 2010 after Simple English Wikipedia became available. It is worth noting that the Simple Wikipedia data has some issues on the quality and degree of simplicity (Xu et al., 2015b). The shortage of high quality data is becoming gradually alleviated as the Newsela corpus (Xu et al., 2015b) of professionally edited 1000+ articles is released, and as more and more attention and appreciation are given by the research community to data construction (Brunato et al., 2016; Hwang et al., 2015). Multiple studies have shown crowcourcing workers can produce high quality simplifications (Xu et al., 2016; Amancio and Specia, 2014; Pellow and Eskenazi, 2014), though it is costly to scale up. Data will remain a central problem1 as the data-hungry neural generation models (Nisioi et al., 2017) are a promising direction for future work. Besides data, another severe problem is evaluation. In fact, one common human evaluation that uses a five point Likert scale on grammaticality, meaning and simplicity should be considered Lexical simplification as a subtask can utilize or bypass the need of parallel data (Glavaš and Štajner, 2015; Paetzold and Specia, 2016; Pavlick and Callison-Burch, 2016).

[1]  M. L. Chappelle The language of food. , 1972, The American journal of nursing.

[2]  M. L. Chappelle The Language of Food , 1972, The American journal of nursing.

[3]  Luke S. Zettlemoyer,et al.  Online Learning of Relaxed CCG Grammars for Parsing to Logical Form , 2007, EMNLP.

[4]  Jonathan H. Clark,et al.  A Classifier System for Author Recognition Using Synonym-Based Features , 2007, MICAI.

[5]  Luke S. Zettlemoyer,et al.  Reinforcement Learning for Mapping Instructions to Actions , 2009, ACL.

[6]  Swapna Somasundaran,et al.  Recognizing Stances in Online Debates , 2009, ACL.

[7]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[8]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[9]  Raymond J. Mooney,et al.  Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[10]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[11]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[12]  Chris Callison-Burch,et al.  Evaluating Sentence Compression: Pitfalls and Suggested Remedies , 2011, Monolingual@ACL.

[13]  Marco Guerini,et al.  Do Linguistic Style and Readability of Scientific Abstracts Affect their Virality? , 2012, ICWSM.

[14]  David Bamman,et al.  Gender identity and lexical variation in social media , 2012, 1210.4567.

[15]  Ralph Grishman,et al.  Paraphrasing for Style , 2012, COLING.

[16]  Daniel Jurafsky,et al.  He Said, She Said: Gender in the ACL Anthology , 2012, Discoveries@ACL.

[17]  Stefanie Tellex,et al.  Interpreting and Executing Recipes with a Cooking Robot , 2012, ISER.

[18]  Jun-Ming Xu,et al.  Learning from Bullying Traces in Social Media , 2012, NAACL.

[19]  Rachel Greenstadt,et al.  Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity , 2012, TSEC.

[20]  Daniel Jurafsky,et al.  Linguistic Models for Analyzing and Detecting Biased Language , 2013, ACL.

[21]  Jure Leskovec,et al.  A computational approach to politeness with application to social factors , 2013, ACL.

[22]  Subbarao Kambhampati,et al.  Dude, srsly?: The Surprisingly Formal Nature of Twitter's Language , 2013, ICWSM.

[23]  Wei Xu,et al.  Gathering and Generating Paraphrases from Twitter with Application to Normalization , 2013, BUCC@ACL.

[24]  Luke S. Zettlemoyer,et al.  Automatic Idiom Identification in Wiktionary , 2013, EMNLP.

[25]  Dan Klein,et al.  Unsupervised Transcription of Historical Documents , 2013, ACL.

[26]  Alexander Yates,et al.  Large-scale Semantic Parsing via Schema Matching and Lexicon Extension , 2013, ACL.

[27]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[28]  David Yarowsky,et al.  Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media , 2013, EMNLP.

[29]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[30]  Luke S. Zettlemoyer,et al.  Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions , 2013, TACL.

[31]  Li Wang,et al.  How Noisy Social Media Text, How Diffrnt Social Media Sources? , 2013, IJCNLP.

[32]  Advaith Siddharthan,et al.  Hybrid text simplification using synchronous dependency grammars with hand-written and automatically harvested rules , 2014, EACL.

[33]  Maarten Sap,et al.  Developing Age and Gender Predictive Lexica over Social Media , 2014, EMNLP.

[34]  Maxine Eskénazi,et al.  An Open Corpus of Everyday Documents for Simplification Tasks , 2014, PITR@EACL.

[35]  Chris Callison-Burch,et al.  Extracting Lexically Divergent Paraphrases from Twitter , 2014, TACL.

[36]  Dan Klein,et al.  Improved Typesetting Models for Historical OCR , 2014, ACL.

[37]  Noah A. Smith,et al.  A Dependency Parser for Tweets , 2014, EMNLP.

[38]  Lucia Specia,et al.  An Analysis of Crowdsourced Text Simplifications , 2014, PITR@EACL.

[39]  Wei Xu,et al.  Data-driven Approaches for Paraphrasing across Language Variations , 2014 .

[40]  Hongyu Guo,et al.  The Unreasonable Effectiveness of Word Representations for Twitter Named Entity Recognition , 2015, NAACL.

[41]  Anoop Sarkar,et al.  Improving Statistical Machine Translation with a Multilingual Paraphrase Database , 2015, EMNLP.

[42]  David Bamman,et al.  Contextualized Sarcasm Detection on Twitter , 2015, ICWSM.

[43]  Yi Yang,et al.  Unsupervised Multi-Domain Adaptation with Feature Embeddings , 2015, NAACL.

[44]  Goran Glavas,et al.  Simplifying Lexical Simplification: Do We Need Simplified Corpora? , 2015, ACL.

[45]  Chris Callison-Burch,et al.  Problems in Current Text Simplification Research: New Data Can Help , 2015, TACL.

[46]  Nizar Habash,et al.  Predicting the Structure of Cooking Recipes , 2015, EMNLP.

[47]  Noah A. Smith,et al.  The Utility of Text: The Case of Amicus Briefs and the Supreme Court , 2014, AAAI.

[48]  Marine Carpuat Connotation in Translation , 2015, WASSA@EMNLP.

[49]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[50]  Dan Klein,et al.  Unsupervised Code-Switching for Multilingual Historical Document Transcription , 2015, NAACL.

[51]  Wang Ling,et al.  Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , 2015, EMNLP.

[52]  Christof Monz,et al.  Five Shades of Noise: Analyzing Machine Translation Errors in User-Generated Text , 2015, NUT@IJCNLP.

[53]  Tomoki Toda,et al.  Linguistic Individuality Transformation for Spoken Language , 2015, Natural Language Dialog Systems and Intelligent Assistants.

[54]  Noah A. Smith,et al.  The Media Frames Corpus: Annotations of Frames Across Issues , 2015, ACL.

[55]  Gözde Özbal,et al.  Echoes of Persuasion: The Effect of Euphony in Persuasive Communication , 2015, NAACL.

[56]  Dirk Hovy,et al.  Challenges of studying and processing dialects in social media , 2015, NUT@IJCNLP.

[57]  Dirk Hovy,et al.  Cross-lingual syntactic variation over age and gender , 2015, CoNLL.

[58]  Noah A. Smith,et al.  A Corpus and Model Integrating Multiword Expressions and Supersenses , 2015, NAACL.

[59]  Lukasz Kaiser,et al.  Sentence Compression by Deletion with LSTMs , 2015, EMNLP.

[60]  Wei Wu,et al.  Aligning Sentences from Standard Wikipedia to Simple Wikipedia , 2015, NAACL.

[61]  Chris Callison-Burch,et al.  SemEval-2015 Task 1: Paraphrase and Semantic Similarity in Twitter (PIT) , 2015, *SEMEVAL.

[62]  Dirk Hovy,et al.  Personality Traits on Twitter—or—How to Get 1,500 Personality Tests in a Week , 2015, WASSA@EMNLP.

[63]  Yejin Choi,et al.  Mise en Place: Unsupervised Interpretation of Instructional Recipes , 2015, EMNLP.

[64]  Michael Vitale,et al.  The Wisdom of Crowds , 2015, Cell.

[65]  Yejin Choi,et al.  Connotation Frames: A Data-Driven Investigation , 2015, ACL.

[66]  Lyle H. Ungar,et al.  Discovering User Attribute Stylistic Differences via Paraphrasing , 2016, AAAI.

[67]  Jianfeng Gao,et al.  A Persona-Based Neural Conversation Model , 2016, ACL.

[68]  Rada Mihalcea,et al.  Identifying Cross-Cultural Differences in Word Usage , 2016, COLING.

[69]  Alan Ritter,et al.  TweeTime : A Minimally Supervised Method for Recognizing and Normalizing Time Expressions in Twitter , 2016, EMNLP.

[70]  Rada Mihalcea,et al.  Finding Optimists and Pessimists on Twitter , 2016, ACL.

[71]  Luke S. Zettlemoyer,et al.  Global Neural CCG Parsing with Optimality Guarantees , 2016, EMNLP.

[72]  Barbara Plank,et al.  What to do about non-standard (or non-canonical) language in NLP , 2016, KONVENS.

[73]  Dan Garrette,et al.  An Unsupervised Model of Orthographic Variation for Historical Document Transcription , 2016, NAACL.

[74]  Chris Callison-Burch,et al.  Optimizing Statistical Machine Translation for Text Simplification , 2016, TACL.

[75]  Kristina Toutanova,et al.  A Dataset and Evaluation Metrics for Abstractive Compression of Sentences and Short Paragraphs , 2016, EMNLP.

[76]  Chris Callison-Burch,et al.  Simple PPDB: A Paraphrase Database for Simplification , 2016, ACL.

[77]  Yoav Artzi,et al.  Neural Shift-Reduce CCG Semantic Parsing , 2016, EMNLP.

[78]  Timothy Baldwin,et al.  Twitter Geolocation Prediction Shared Task of the 2016 Workshop on Noisy User-generated Text , 2016, NUT@COLING.

[79]  Lucia Specia,et al.  Benchmarking Lexical Simplification Systems , 2016, LREC.

[80]  Felice Dell'Orletta,et al.  PaCCSS-IT: A Parallel Corpus of Complex-Simple Sentences for Automatic Text Simplification , 2016, EMNLP.

[81]  Cristian Danescu-Niculescu-Mizil,et al.  Winning Arguments: Interaction Dynamics and Persuasion Strategies in Good-faith Online Discussions , 2016, WWW.

[82]  Walter Daelemans,et al.  TwiSty: A Multilingual Twitter Stylometry Corpus for Gender and Personality Profiling , 2016, LREC.

[83]  Sampo Pyysalo,et al.  Attending to Characters in Neural Sequence Labeling Models , 2016, COLING.

[84]  William L. Hamilton,et al.  Language from police body camera footage shows racial disparities in officer respect , 2017, Proceedings of the National Academy of Sciences.

[85]  Hua He,et al.  A Continuously Growing Dataset of Sentential Paraphrases , 2017, EMNLP.

[86]  Sergiu Nisioi,et al.  Exploring Neural Text Simplification Models , 2017, ACL.

[87]  Michael S. Bernstein,et al.  Anyone Can Become a Troll: Causes of Trolling Behavior in Online Discussions , 2017, CSCW.