Writing styles in different scientific disciplines: a data science approach

We quantified several different elements that reflect writing styles of scientific papers in four related disciplines: physics, astrophysics, mathematics, and computer science. Text descriptors such as the use of punctuation characters, the use of upper case letters, use of quotations, and other descriptors that are not based on the words used in the papers were extracted from each document. Based on these features alone an automatic classifier was able to identify the discipline of the paper with accuracy much higher than mere chance, showing that different disciplines can be differentiated by their writing styles, and without using their content directly as reflected by common words used in the papers. The study showed statistically significant differences between the different disciplines such as use of acronyms, sentence length, word length, and more. Our findings also show changes in writing styles in specific disciplines over time. For instance, mathematicians and computer scientists began to use less acronyms starting from 2006, and there is a dramatic decrease of the average of punctuation characters in mathematics papers. These observations suggest that even in closely related disciplines there are differences in the scientific communication expressed through writing styles, demonstrating the existence of a “signature” writing style developed in each discipline. These findings should also be taken into account when a multidisciplinary group of collaborators assign writing duties on a joint scientific manuscript.

[1]  Adam Okulicz-Kozaryn Cluttered writing: adjectives and adverbs in academia , 2012, Scientometrics.

[2]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[3]  David Aldous,et al.  The Continuum Random Tree III , 1991 .

[4]  Caroline Coffin,et al.  Writing for different disciplines , 2003 .

[5]  Candia Morgan,et al.  Word, Definitions and Concepts in Discourses of Mathematics, Teaching and Learning , 2005 .

[6]  Ron Kohavi,et al.  The Power of Decision Tables , 1995, ECML.

[7]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[8]  Interactive Writing in Mathematics Class: Getting Started , 2002 .

[9]  Charles W. Fox,et al.  Language and socioeconomics predict geographic variation in peer review outcomes at an ecology journal , 2017, Scientometrics.

[10]  M. D. Gordon,et al.  A critical reassessment of inferred relations between multiple authorship, scientific collaboration, the production of papers and their acceptance for publication , 1980, Scientometrics.

[11]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[12]  Lior Shamir,et al.  Leveraging Pattern Recognition Consistency Estimation for Crowdsourcing Data Analysis , 2016, IEEE Transactions on Human-Machine Systems.

[13]  Marina Bondi,et al.  Academic discourse across disciplines , 2006 .

[14]  Betty Samraj,et al.  Introductions in research articles: variations across disciplines , 2002 .

[15]  Lei Lei When science meets cluttered writing: adjectives and adverbs in academia revisited , 2016, Scientometrics.

[16]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[17]  Shlomo Argamon,et al.  Language use reflects scientific methodology: A corpus-based study of peer-reviewed journal articles , 2008, Scientometrics.

[18]  Adwait Ratnaparkhi,et al.  Learning to Parse Natural Language with Maximum Entropy Models , 1999, Machine Learning.

[19]  Peter Fankhauser,et al.  Data Mining with Shallow vs. Linguistic Features to Study Diversification of Scientific Registers , 2014, LREC.

[20]  K. Hyland,et al.  Is There an "Academic Vocabulary"? , 2007 .

[21]  Lior Shamir,et al.  Source Code for Biology and Medicine Open Access Wndchrm – an Open Source Utility for Biological Image Analysis , 2022 .

[22]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[23]  Ken Hyland,et al.  Writing in the disciplines: research evidence for specificity , 2009 .