A Stochastic Method of Tracking a Vocal Performer

Automated accompaniment systems are computer systems that "play along" with a solo musician or a group of musicians when given the score for a piece of music. These systems must be able to "listen" to the live musicians by tracking their progress through the score in real-time. Accurate tracking of vocal performers (as opposed to instrumentalists) is a particularly challenging case. In this paper, we present a tracking method based upon a statistical model of vocal performances. This technique incorporates both information obtained from real-time signal processing of a performance (such as fundamental pitch) and information describing the performer's movement through the score (namely tempo and elapsed time). We present a description of how this model is incorporated as part of a system currently used to accompany vocal performers. We provide a preliminary evaluation of its ability to estimate score position of a performer. preliminary measurements of its position estimation ability. We conclude with a discussion of some future work to enhance the statistical model and possibly lead to improved vocal tracking. 2 A Stochastic Model of Location As previously mentioned, a musical score will often consist of one or more solo parts and an accompaniment. In the case of Western classical music written for a single vocalist, the solo part will consist of a sequence of notes, each note indicating at least pitch, a syllable to be sung, and relative duration. Other information, such as dynamic and articulation, may also be specified. Also, the tempo for a given piece will likely vary within a single performance, as well as across performances. Tempo variations may be explicitly written in the score by the composer, or may be the result of conscious choices made by the performer. The model we use to track a vocalist represents the vocalist's part as a sequence of events that have a fixed, or at least a desired, ordering. Each event is specified by: 1. A relative length which defines the size or duration of the event, as indicated in the score, relative to other events in the score. 2. An observation distribution which completely specifies the probability of every possible sensor output at any time during the event. The relative length may be specified in beats for a fixed tempo, or in some unit of time resulting from the conversion of beats to "idealized time" using a fixed, idealized tempo. The length is assumed to be real-valued and not necessarily a positive integer. The vocalist's part in the score is thus viewed as a sequence of events, each event spanning a region of a number line. The score position of a singer is represented as a real number assuming a value between 0 and the sum of the lengths of all events in the score. Score position is thus specified in either idealized beats or idealized time, and can indicate the performer's location at a granularity finer than an event. At any point while tracking an actual performance, the position of the vocalist is represented stochastically as a continuous density function over score position. This is referred to as the score position density. The area under this function between two score positions indicates the probability that the performer is within that region of the score. This is depicted in Figure 1. The area over the entire length of the score is always 1, indicating it is 100% likely that the performer is in the score. As the performance progresses and subsequent observations (detected features) are reported, the score position density is updated to yield a probability distribution describing the performer's new location. The observation distribution for each event specifies the probability of observing any possible value of a detected feature when the vocalist is performing that event. This distribution will generally be conditioned on information provided in the score. For example, if pitch detection is applied to the performance, then the observation distribution for a given event might specify for each pitch the likelihood that the detector will report that pitch, conditioned on the pitch written in the score for that event. As another example, distributions might also describe the likelihood of detectable spectral features that are correlated with sung phonemes. Our approach to tracking the performer is conceptually simple. For each new observation, we use the current score position density and the observation distributions to estimate a new score position density. This updated density indicates the current location of the performer in the score. In practice, calculating a new score position density requires a number of simplifications, assumptions, and approximations. In order to describe our system, we will first present a mathematical model for updating the score position density. Then we will describe an implementation of this model.