Evaluating Improvised Hip Hop Lyrics - Challenges and Observations

We investigate novel challenges involved in comparing model performance on the task of improvising responses to hip hop lyrics and discuss observations regarding inter-evaluator agreement on judging improvisation quality. We believe the analysis serves as a first step toward designing robust evaluation strategies for improvisation tasks, a relatively neglected area to date. Unlike most natural language processing tasks, improvisation tasks suffer from a high degree of subjectivity, making it difficult to design discriminative evaluation strategies to drive model development. We propose a simple strategy with fluency and rhyming as the criteria for evaluating the quality of generated responses, which we apply to both our inversion transduction grammar based FREESTYLE hip hop challenge-response improvisation system, as well as various contrastive systems. We report inter-evaluator agreement for both English and French hip hop lyrics, and analyze correlation with challenge length. We also compare the extent of agreement in evaluating fluency with that of rhyming, and quantify the difference in agreement with and without precise definitions of evaluation criteria.

[1]  Dekai Wu,et al.  Reestimation of Reified Rules in Semiring Parsing and Biparsing , 2011, SSST@ACL.

[2]  John Cocke,et al.  Programming languages and their compilers: Preliminary notes , 1969 .

[3]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[4]  Hermann Ney,et al.  CDER: Efficient MT Evaluation Using Block Movements , 2006, EACL.

[5]  Dekai Wu,et al.  Modeling Hip Hop Challenge-Response Lyrics as Machine Translation , 2013, MTSUMMIT.

[6]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[7]  David Chiang,et al.  Hierarchical Phrase-Based Translation , 2007, CL.

[8]  Dekai Wu,et al.  Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora , 1997, CL.

[9]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[10]  Kevin Knight,et al.  Automatic Analysis of Rhythmic Poetry with Applications to Generation and Translation , 2010, EMNLP.

[11]  François Pachet,et al.  Markov Constraints for Generating Lyrics with Style , 2012, ECAI.

[12]  Hermann Ney,et al.  A Comparative Study on Reordering Constraints in Statistical Machine Translation , 2003, ACL.

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14]  Sankar Kuppan,et al.  Automatic Generation of Tamil Lyrics for Melodies , 2009 .

[15]  Pierre A. Devijver,et al.  Baum's forward-backward algorithm revisited , 1985, Pattern Recognit. Lett..

[16]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[17]  Morgan Sonderegger,et al.  Applications of graph theory to an English rhyming corpus , 2011, Comput. Speech Lang..

[18]  Dekai Wu,et al.  A Polynomial-Time Algorithm for Statistical Machine Translation , 1996, ACL.

[19]  Dekai Wu,et al.  Freestyle: a challenge-response system for hip hop lyrics via unsupervised induction of stochastic transduction grammars , 2013, INTERSPEECH.

[20]  Jakob Uszkoreit,et al.  “Poetic” Statistical Machine Translation: Rhyme and Meter , 2010, EMNLP.

[21]  Dekai Wu,et al.  An Algorithm for Simultaneously Bracketing Parallel Texts by Aligning Words , 1995, ACL.

[22]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[23]  John Cocke,et al.  Programming languages and their compilers , 1969 .

[24]  N. J. A. Sloane,et al.  The On-Line Encyclopedia of Integer Sequences , 2003, Electron. J. Comb..

[25]  Sebastian Stüker,et al.  Overview of the IWSLT 2011 evaluation campaign , 2011, IWSLT.

[26]  John DeNero,et al.  Better Word Alignments with Supervised ITG Models , 2009, ACL.

[27]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[28]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[29]  Dekai Wu Textual Entailment Recognition Using Inversion Transduction Grammars , 2005, MLCW.

[30]  Dekai Wu Trainable Coarse Bilingual Grammars for Parallel Text Bracketing , 1995, VLC@ACL.

[31]  Dekai Wu,et al.  From Finite-State to Inversion Transductions: Toward Unsupervised Bilingual Grammar Induction , 2012, COLING.

[32]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[33]  Kevin Knight,et al.  Unsupervised Discovery of Rhyme Schemes , 2011, ACL.

[34]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[35]  Dekai Wu,et al.  Unsupervised Rhyme Scheme Identification in Hip Hop Lyrics Using Hidden Markov Models , 2013, SLSP.

[36]  Pascale Fung,et al.  Inversion Transduction Grammar Constraints for Mining Parallel Sentences from Quasi-Comparable Corpora , 2005, IJCNLP.

[37]  Long Jiang,et al.  Generating Chinese Couplets using a Statistical MT Approach , 2008, COLING.

[38]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[39]  Dekai Wu,et al.  Stochastic Inversion Transduction Grammars, with Application to Segmentation, Bracketing, and Alignment of Parallel Corpora , 1995, IJCAI.