Modelling Morphographemic Alternations in Derivation of Czech

Abstract The present paper deals with morphographemic alternations in Czech derivation with regard to the build-up of a large-coverage lexical resource specialized in derivational morphology of contemporary Czech (DeriNet database). After a summary of available descriptions in the Czech linguistic literature and Natural Language Processing, an extensive list of alternations is provided in the first part of the paper with a focus on their manifestation in writing. Due to the significant frequency and limited predictability of alternations in Czech derivation, several bottom-up methods were used in order to adequately model the alternations in DeriNet. Suffix-substitution rules proved to be efficient for alternations in the final position of the stem, whereas a specialized approach of extracting alternations from inflectional paradigms was used for modelling alternations within the roots. Alternations connected with derivation of verbs were handled as a separate task. DeriNet data are expected to be helpful in developing a tool for morphemic segmentation and, once the segmentation is available, to become a reliable resource for data-based description of word formation including alternations in Czech.

[1]  Scott Miller,et al.  Name Tagging with Word Clusters and Discriminative Training , 2004, NAACL.

[2]  Mauro Cettolo,et al.  Overview of the IWSLT 2017 Evaluation Campaign , 2017, IWSLT.

[3]  Michael Röder,et al.  Exploring the Space of Topic Coherence Measures , 2015, WSDM.

[4]  Jason Lee,et al.  Fully Character-Level Neural Machine Translation without Explicit Segmentation , 2016, TACL.

[5]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[6]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[7]  Ondrej Bojar,et al.  Results of the WMT17 Metrics Shared Task , 2017, WMT.

[8]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[9]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[10]  Quoc V. Le,et al.  A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.

[11]  Pavel Smerk Fast Morphological Analysis of Czech , 2009, RASLAN.

[12]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[13]  Padhraic Smyth,et al.  Statistical entity-topic models , 2006, KDD '06.

[14]  Timothy Baldwin,et al.  Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality , 2014, EACL.

[15]  Michael J. Paul,et al.  A Two-Dimensional Topic-Aspect Model for Discovering Multi-Faceted Topics , 2010, AAAI.

[16]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[17]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[18]  Bryan Silverthorn,et al.  Spherical Topic Models , 2010, ICML.

[19]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Ondrej Dusek,et al.  The Joy of Parallelism with CzEng 1.0 , 2012, LREC.

[21]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[22]  Marie Mikulová,et al.  ForFun 1.0: Prague Database of Forms and Functions - An Invaluable Resource for Linguistic Research , 2018, LREC.

[23]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[24]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[25]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[26]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[27]  Kevin Bello,et al.  Improving Topic Coherence Using Entity Extraction Denoising , 2018, Prague Bull. Math. Linguistics.

[28]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[29]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[30]  Xavier Carreras,et al.  Wide-Coverage Spanish Named Entity Extraction , 2002, IBERAMIA.

[31]  Stefan Trausan-Matu,et al.  Improving Topic Evaluation Using Conceptual Knowledge , 2011, IJCAI.

[32]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[33]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[34]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[35]  David Buttler,et al.  Exploring Topic Coherence over Many Models and Many Topics , 2012, EMNLP.

[36]  Marie Mikulová,et al.  The Relation of Form and Function in Linguistic Theory and in a Multilayer Treebank , 2018, TLT.

[37]  Yang You,et al.  Scaling SGD Batch Size to 32K for ImageNet Training , 2017, ArXiv.

[38]  Stephen E Fienberg,et al.  Reconceptualizing the classification of PNAS articles , 2010, Proceedings of the National Academy of Sciences.

[39]  Thang Nguyen,et al.  Is Your Anchor Going Up or Down? Fast and Accurate Supervised Topic Models , 2015, NAACL.

[40]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[41]  Doug Downey,et al.  Efficient Methods for Incorporating Knowledge into Topic Models , 2015, EMNLP.

[42]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[43]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[44]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[45]  Mark Stevenson,et al.  Evaluating the Use of Clustering for Automatically Organising Digital Library Collections , 2012, TPDL.

[46]  Marie Mikulová,et al.  Announcing Prague Czech-English Dependency Treebank 2.0 , 2012, LREC.

[47]  Marie Mikulová,et al.  Search for the Relation of Form and Function Using the ForFun Database , 2018, Prague Bull. Math. Linguistics.

[48]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[49]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[50]  Jan Snajder,et al.  DErivBase: Inducing and Evaluating a Derivational Morphology Resource for German , 2013, ACL.

[51]  Mark Stevenson,et al.  Evaluating Topic Coherence Using Distributional Semantics , 2013, IWCS.

[52]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[53]  Marie Mikulová,et al.  PDTSC 2.0 - Spoken Corpus with Rich Multi-layer Structural Annotation , 2017, TSD.

[54]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[55]  Marie Mikulová,et al.  Subcategorization of Adverbial Meanings Based on Corpus Data , 2017 .

[56]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[57]  Petr Sgall,et al.  The Meaning Of The Sentence In Its Semantic And Pragmatic Aspects , 1986 .

[58]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[59]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[60]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[61]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[62]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[63]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[64]  Philipp Koehn,et al.  Findings of the 2017 Conference on Machine Translation (WMT17) , 2017, WMT.

[65]  Kurt Hornik,et al.  topicmodels : An R Package for Fitting Topic Models , 2016 .