knowledge in an attempt to achieve more meaningful decompositions

This paper presents a method for automatic music transcription applied to audio recordings of a cappella performances with multiple singers. We propose a system for multi-pitch detection and voice assignment that integrates an acoustic and a music language model. The acoustic model performs spectrogram decomposition, extending probabilistic latent component analysis (PLCA) using a six-dimensional dictionary with pre-extracted log-spectral templates. The music language model performs voice separation and assignment using hidden Markov models that apply musicological assumptions. By integrating the two models, the system is able to detect multiple concurrent pitches in polyphonic vocal music and assign each detected pitch to a specific voice type such as soprano, alto, tenor or bass (SATB). We compare our system against multiple baselines, achieving state-of-the-art results for both multi-pitch detection and voice assignment on a dataset of Bach chorales and another of barbershop quartets. We also present an additional evaluation of our system using varied pitch tolerance levels to investigate its performance at 20-cent pitch resolution.

[1]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  Judith C. Brown Calculation of a constant Q spectral transform , 1991 .

[4]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[5]  C. Chuan Tone and Voice: A Derivation of the Rules of Voice-Leading from Perceptual Principles , 2001 .

[6]  H. Hoos,et al.  Voice Separation - A Local Optimization Approach , 2002, ISMIR.

[7]  Masataka Goto,et al.  RWC Music Database: Music genre database and musical instrument sound database , 2003, ISMIR.

[8]  Elaine Chew,et al.  Separating Voices in Polyphonic Music: A Contig Mapping Approach , 2004, CMMR.

[9]  Masataka Goto,et al.  A real-time music-scene-description system: predominant-F0 estimation for detecting melody and bass lines in real-world audio signals , 2004, Speech Commun..

[10]  Paul E. Utgoff,et al.  VOISE: Learning to Segregate Voices in Explicit and Implicit Polyphony , 2005, ISMIR.

[11]  M.P. Ryynanen,et al.  Polyphonic music transcription using note event modeling , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[12]  Anssi Klapuri,et al.  Multiple Fundamental Frequency Estimation by Summing Harmonic Amplitudes , 2006, ISMIR.

[13]  David Temperley,et al.  A Probabilistic Model of Melody Perception , 2008, ISMIR.

[14]  Hirokazu Kameoka,et al.  A Multipitch Analyzer Based on Harmonic Temporal Structured Clustering , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Yannis Manolopoulos,et al.  Horizontal and Vertical Integration/Segregation in Auditory Streaming: A Voice Separation Algorithm for Symbolic Musical Data , 2007 .

[16]  Dmitri Tymoczko Scale Theory, Serial Theory and Voice Leading , 2008 .

[17]  Bhiksha Raj,et al.  Probabilistic Latent Variable Models as Nonnegative Factorizations , 2008, Comput. Intell. Neurosci..

[18]  E. Cambouropoulos Voice And Stream: Perceptual And Computational Modeling Of Voice Separation , 2008 .

[19]  Bryan Pardo,et al.  Streaming from MIDI Using Constraint Satisfaction Optimization and Sequence Alignment , 2009, ICMC.

[20]  Paris Smaragdis,et al.  Relative pitch estimation of multiple instruments , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Mert Bay,et al.  Evaluation of Multiple-F0 Estimation and Tracking Systems , 2009, ISMIR.

[22]  Emmanuel Vincent,et al.  Adaptive Harmonic Spectral Decomposition for Multiple Pitch Estimation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Daniel P. W. Ellis,et al.  Transcribing Multi-Instrument Polyphonic Music With Hierarchical Eigeninstruments , 2011, IEEE Journal of Selected Topics in Signal Processing.

[24]  Roland Badeau,et al.  Blind Harmonic Adaptive Decomposition applied to supervised source separation , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[25]  Hirokazu Kameoka,et al.  Constrained and regularized variants of non-negative matrix factorization incorporating music-specific constraints , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Mert Bay,et al.  Second Fiddle is Important Too: Pitch Tracking Individual Voices in Polyphonic Music , 2012, ISMIR.

[27]  Emilia Gómez,et al.  Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Simon Dixon,et al.  A Shift-Invariant Latent Variable Model for Automatic Music Transcription , 2012, Computer Music Journal.

[29]  José Manuel Iñesta Quereda,et al.  Efficient methods for joint estimation of multiple fundamental frequencies in music signals , 2012, EURASIP Journal on Advances in Signal Processing.

[30]  Xavier Serra,et al.  Essentia: An Audio Analysis Library for Music Information Retrieval , 2013, ISMIR.

[31]  Anssi Klapuri,et al.  Missing template estimation for user-assisted music transcription , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Anssi Klapuri,et al.  Automatic music transcription: challenges and future directions , 2013, Journal of Intelligent Information Systems.

[33]  Simon Dixon,et al.  Multiple-instrument polyphonic music transcription using a temporally constrained shift-invariant model. , 2013, The Journal of the Acoustical Society of America.

[34]  Bryan Pardo,et al.  Multi-pitch Streaming of Harmonic Sound Mixtures , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Roland Badeau,et al.  Controlling the convergence rate to help parameter estimation in a PLCA-based model , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[36]  Anssi Klapuri,et al.  A Matlab Toolbox for Efficient Perfect Reconstruction Time-Frequency Transforms with Log-Frequency Resolution , 2014, Semantic Audio.

[37]  Tillman Weyde,et al.  Template Adaptation for Improving Automatic Music Transcription , 2014, ISMIR.

[38]  Anssi Klapuri,et al.  Improving instrument recognition in polyphonic music through system integration , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Simon Dixon,et al.  PYIN: A fundamental frequency estimator using probabilistic threshold distributions , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Tillman Weyde,et al.  An Efficient Temporally-Constrained Probabilistic Model for Multiple-Instrument Music Transcription , 2015, ISMIR.

[41]  Gaël Richard,et al.  Multipitch estimation using a PLCA-based model: Impact of partial user annotation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Tillman Weyde,et al.  Bringing ‘Musicque into the tableture’: machine-learning models for polyphonic transcription of 16th-century lute tablature , 2015 .

[43]  Emmanouil Benetos,et al.  Automatic transcription of Turkish microtonal music. , 2015, The Journal of the Acoustical Society of America.

[44]  Ciril Bohak,et al.  Transcription of Polyphonic Vocal Music with a Repetitive Melodic Structure , 2016 .

[45]  Gerhard Widmer,et al.  On the Potential of Simple Framewise Approaches to Piano Transcription , 2016, ISMIR.

[46]  Razvan C. Bunescu,et al.  A Neural Greedy Model for Voice Separation in Symbolic Music , 2016, ISMIR.

[47]  Simon Dixon,et al.  An End-to-End Neural Network for Polyphonic Piano Music Transcription , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[48]  Andrew McLeod,et al.  HMM-Based Voice Separation of MIDI Performance , 2016 .

[49]  Mark D. Plumbley,et al.  Non-Negative Group Sparsity with Subspace Note Modelling for Polyphonic Transcription , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[50]  Nicolas Guiomard-Kagan,et al.  Improving Voice Separation by Better Connecting Contigs , 2016, ISMIR.

[51]  Emmanouil Benetos,et al.  Automatic Transcription of a Cappella Recordings from Multiple Singers , 2017 .

[52]  Justin Salamon,et al.  Deep Salience Representations for F0 Estimation in Polyphonic Music , 2017, ISMIR.

[53]  Mark Steedman,et al.  Multi-Pitch Detection and Voice Assignment for A Cappella Recordings of Multiple Singers , 2017, ISMIR.

[54]  Zaïd Harchaoui,et al.  Learning Features of Music from Scratch , 2016, ICLR.