A Unit Selection Methodology for Music Generation Using Deep Neural Networks

Several methods exist for a computer to generate music based on data including Markov chains, recurrent neural networks, recombinancy, and grammars. We explore the use of unit selection and concatenation as a means of generating music using a procedure based on ranking, where, we consider a unit to be a variable length number of measures of music. We first examine whether a unit selection method, that is restricted to a finite size unit library, can be sufficient for encompassing a wide spectrum of music. We do this by developing a deep autoencoder that encodes a musical input and reconstructs the input by selecting from the library. We then describe a generative model that combines a deep structured semantic model (DSSM) with an LSTM to predict the next unit, where units consist of four, two, and one measures of music. We evaluate the generative model using objective metrics including mean rank and accuracy and with a subjective listening test in which expert musicians are asked to complete a forced-choiced ranking task. We compare our model to a note-level generative baseline that consists of a stacked LSTM trained to predict forward by one note.

[1]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[2]  Fred Lerdahl,et al.  Cognitive constraints on compositional systems , 1992 .

[3]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[4]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[5]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[6]  Shlomo Dubnov,et al.  Guided Music Synthesis with Variable Markov Oracle , 2014, MUME@AIIDE.

[7]  R. Jackendoff,et al.  A Generative Theory of Tonal Music , 1985 .

[8]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[9]  Geraint A. Wiggins,et al.  AI Methods for Algorithmic Composition: A Survey, a Critical View and Future Prospects , 1999 .

[10]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[11]  Judy A. Franklin,et al.  Recurrent Neural Networks for Music Computation , 2006, INFORMS J. Comput..

[12]  D Schön,et al.  Comparison between Language and Music , 2001, Annals of the New York Academy of Sciences.

[13]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[14]  J. Mccormack Grammar-based music composition , 1996 .

[15]  Kratarth Goel,et al.  Polyphonic Music Generation by Modeling Temporal Dependencies Using a RNN-DBN , 2014, ICANN.

[16]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[17]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[18]  François Pachet,et al.  Markov constraints: steerable generation of Markov sequences , 2010, Constraints.

[19]  Paul Taylor,et al.  Automatically clustering similar units for unit selection in speech synthesis , 1997, EUROSPEECH.

[20]  Roseli A. Francelin Romero,et al.  Generation of composed musical structures through recurrent neural networks based on chaotic inspiration , 2011, The 2011 International Joint Conference on Neural Networks.

[21]  Robin C. Laney,et al.  Developing and evaluating computational models of musical style , 2015, Artificial Intelligence for Engineering Design, Analysis and Manufacturing.

[22]  Parag Chordia,et al.  Predictive Tabla Modelling Using Variable-length Markov and Hidden Markov Models , 2011 .

[23]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[24]  Philip N. Johnson-Laird,et al.  How Jazz Musicians Improvise , 2002 .

[25]  Giovanni De Poli,et al.  On Evaluating Systems for Generating Expressive Music Performance: the Rencon Experience , 2012 .

[26]  Ann K. Syrdal,et al.  Preselection of candidate units in a unit selection-based text-to-speech synthesis system , 2000, INTERSPEECH.

[27]  Jeff Pressing,et al.  Improvisation: Methods and models. , 1988 .

[28]  Dan Morris,et al.  MySong: automatic accompaniment generation for vocal melodies , 2008, CHI.

[29]  Robert B. Cantrick,et al.  A Generative Theory of Tonal Music , 1985 .

[30]  Lie Lu,et al.  Music type classification by spectral contrast feature , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[31]  Catherine J. Stevens,et al.  On-line experimental methods to evaluate text-to-speech (TTS) synthesis: effects of voice gender and signal quality on intelligibility, naturalness and preference , 2005, Comput. Speech Lang..

[32]  Tao Li,et al.  A comparative study on content-based music genre classification , 2003, SIGIR.

[33]  David Cope,et al.  One approach to musical intelligence , 1999, IEEE Intell. Syst..

[34]  Jürgen Schmidhuber,et al.  Finding temporal structure in music: blues improvisation with LSTM recurrent networks , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[35]  Krzysztof Z. Gajos,et al.  ChordRipple: Recommending Chords to Help Novice Composers Go Beyond the Ordinary , 2016, IUI.