Aggregating Frame-Level Information in the Spectral Domain With Self-Attention for Speaker Embedding

Most pooling methods in state-of-the-art speaker embedding networks are implemented in the temporal domain. However, due to the high non-stationarity in the feature maps produced from the last frame-level layer, it is not advantageous to use the global statistics (e.g., means and standard deviations) of the temporal feature maps as aggregated embeddings. This motivates us to explore stationary spectral representations and perform aggregation in the spectral domain. In this paper, we propose attentive short-time spectral pooling (attentive STSP) from a Fourier perspective to exploit the local stationarity of the feature maps. In attentive STSP, for each utterance, we compute the spectral representations through a weighted average of the windowed segments within each spectrogram by attention weights and aggregate their lowest spectral components to form the speaker embedding. Because most energy of the feature maps is concentrated in the low-frequency region in the spectral domain, attentive STSP facilitates the information aggregation by retaining the low spectral components only. Moreover, due to the segment-level attention mechanism, attentive STSP can produce smoother attention weights (weights with less variations) than attentive pooling and generalize better to unseen data, making it more robust against the adverse effect of the non-stationarity in the feature maps. Attentive STSP is shown to consistently outperform attentive pooling on VoxCeleb1, VOiCES19-eval, SRE16-eval, and SRE18-CMN2-eval. This observation suggests that applying segment-level attention and leveraging low spectral components can produce discriminative speaker embeddings.