Slavic Corpus and Computational Linguistics

Abstract:In this paper we focus on corpus-linguistic studies that address theoretical questions and on computational linguistic work on corpus annotation that makes corpora useful for linguistic analysis. First we discuss why the corpus linguistic approach was discredited by generative linguists in the second half of the 20th century, how it made a comeback through advances in computing and was finally adopted by usage-based linguistics at the beginning of the 21st century. Then we move on to an overview of necessary and common annotation layers and the issues that are encountered when performing automatic annotation, with special emphasis on Slavic languages. Finally we survey the types of research requiring corpora that Slavic linguists are involved in worldwide, and the resources they have at their disposal.

[1]  Dagmar Divjak,et al.  Cognitive Paths into the Slavic Domain , 2007 .

[2]  D. Speelman,et al.  Change of Paradigms - New Paradoxes , 2015 .

[3]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[4]  Elena Tognini-Bonelli,et al.  Corpus Linguistics at Work , 2002, Computational Linguistics.

[5]  Roland Sussex,et al.  The Slavic Languages , 2006 .

[6]  Brian Hayes,et al.  First Links in the Markov Chain , 2013 .

[7]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[8]  Tore Nesset,et al.  Making choices in Russian: pros and cons of statistical methods for rival forms , 2013 .

[9]  T. Kuhn,et al.  The Structure of Scientific Revolutions. , 1964 .

[10]  L. Olga,et al.  The Locative Alternation and the Russian ‘empty’ prefixes: A case study of the verb gruzit’ ‘load’ , 2012 .

[11]  Serge Sharoff,et al.  Methods and tools for development of the Russian Reference Corpus , 2006 .

[12]  Harry Hirsch Josselson The Russian word count and frequency analysis of grammatical categories of standard literary Russian , 1967 .

[13]  Adam Przepiórkowski,et al.  Towards the National Corpus of Polish , 2008, LREC.

[14]  Daniel Zeman,et al.  Reusable Tagset Conversion Using Tagset Drivers , 2008, LREC.

[15]  Christopher D. Manning Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? , 2011, CICLing.

[16]  Marie Mikulová,et al.  Prague Dependency Treebank , 2017 .

[17]  Tomaz Erjavec,et al.  MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora , 2004, LREC.

[18]  Jörg Tiedemann,et al.  Cross-lingual Dependency Parsing of Related Languages with Rich Morphosyntactic Tagsets , 2014, EMNLP 2014.

[19]  Nick C. Ellis,et al.  FREQUENCY EFFECTS IN LANGUAGE PROCESSING , 2002, Studies in Second Language Acquisition.

[20]  Duško Vitas,et al.  Processing Serbian Written Texts : An Overview of Resources and Basic Tools , 2006 .

[21]  A. A. Markov,et al.  An Example of Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains , 2006, Science in Context.

[22]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[23]  Anatol Stefanowitsch,et al.  Corpora in cognitive linguistics : corpus-based approaches to syntax and lexis , 2006 .

[24]  Stefan Th. Gries,et al.  Ways of trying in Russian: clustering behavioral profiles , 2006, Corpus Linguistics and Linguistic Theory.

[25]  V. Cvrček,et al.  A Data-Driven Analysis of Reader Viewpoints: Reconstructing the Historical Reader Using Keyword Analysis , 2016 .

[26]  E. Dąbrowska Cognitive Linguistics’ seven deadly sins , 2016 .

[27]  David Elworthy Tagset Design and Inflected Languages , 1995, ArXiv.

[28]  P. Sgall Formal and computational Linguistics in Prague , 1995 .

[29]  L Hasher,et al.  Automatic processing of fundamental information: the case of frequency of occurrence. , 1984, The American psychologist.

[30]  Ewa Dabrowska,et al.  Cognitive Semantics and the Polish Dative , 1997 .

[31]  Paul Rayson,et al.  Corpus linguistics around the world , 2006 .

[32]  Kerstin Fischer,et al.  Quantitative Methods in Cognitive Semantics: Corpus-Driven Approaches , 2010 .

[33]  Dagmar Divjak,et al.  Structuring the Lexicon: A Clustered Model for Near-Synonymy , 2010 .

[34]  Olga Lyashevskaya,et al.  Grammatical profiles and the interaction of the lexicon with aspect, tense, and mood in Russian , 2011 .

[35]  T. M. Nikolaeva Soviet developments in machine translation: Russian sentence analysis , 1958, Mech. Transl. Comput. Linguistics.

[36]  Dagmar Divjak,et al.  Mapping between domains. The aspect–modality interaction in Russian , 2009 .

[37]  Julia Kuznetsova Linguistic Profiles: Going from Form to Meaning Via Statistics , 2015 .

[38]  Alan Cienki Spatial cognition and the semantics of prepositions in English, Polish, and Russian , 1989 .

[39]  Elena Paskaleva,et al.  The Long Journey from the Core to the Real Size of Large LDBs , 1993, Workshop On The Acquisition Of Lexical Knowledge From Text.

[40]  Serge Sharoff,et al.  The proper place of men and machines in language technology Processing Russian without any linguistic knowledge , 2011 .

[41]  R. Baayen,et al.  Towards cognitively plausible data science in language research , 2016 .

[42]  Grammatical Profiles and Aspect in Old , 2014 .

[43]  Anatol Stefanowitsch,et al.  New York, Dayton (Ohio), and the Raw Frequency Fallacy , 2005 .

[44]  Laura A. Olga Janda,et al.  Semantic Profiles of Five Russian Prefixes: po-, s-, za-, na-, pro- , 2013 .

[45]  J. Andor The master and his performance: An interview with Noam Chomsky , 2004 .

[46]  Laura A. Janda,et al.  A semantic analysis of the Russian verbal prefixes : za-, pere-, do-, and ot- , 1986 .

[47]  Dagmar Divjak,et al.  Frequency effects in language learning and processing , 2012 .

[48]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[49]  Kris Heylen,et al.  Usage-based approaches in Cognitive Linguistics: A technical state of the art , 2005 .

[50]  Tomaz Erjavec,et al.  MULTEXT-East: morphosyntactic resources for Central and Eastern European languages , 2011, Language Resources and Evaluation.

[51]  L. Janda,et al.  What constructional profiles reveal about synonymy: A case study of Russian words for sadness and happiness , 2009 .

[52]  Laura A. Janda,et al.  Cognitive linguistics : the quantitative turn : the essential reader , 2013 .

[53]  Jennifer E. Arnold,et al.  Heaviness vs. newness: The effects of structural complexity and discourse status on constituent ordering , 2015 .

[54]  Max Silberztein,et al.  Dictionnaires électroniques et analyse automatique de textes : le système intex , 1993 .

[55]  Tore Nesset,et al.  Capturing correlational structure in Russian paradigms: A case study in logistic mixed-effects modeling , 2010 .

[56]  Hanne Martine Eckhoff,et al.  Grammatical Profiles and Aspect in Old Church Slavonic , 2014 .

[57]  Xianwu Zhou,et al.  Cognitive Linguistics: The Quantitative Turn , 2017, J. Quant. Linguistics.

[58]  Hans van Halteren,et al.  Syntactic Wordclass Tagging , 1999 .

[59]  Alexander M. Fraser,et al.  Joint Lemmatization and Morphological Tagging with Lemming , 2015, EMNLP.

[60]  R. Langacker Foundations of Cognitive Grammar: Volume I: Theoretical Prerequisites , 1987 .

[61]  Saso Dzeroski,et al.  DEPARTMENT OF INTELLIGENT SYSTEMS , 2019 .

[62]  Dagmar Divjak,et al.  Frequency effects in language representation , 2012 .