2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	SPE-37.6
Paper Title	SHORT-TIME SPECTRAL AGGREGATION FOR SPEAKER EMBEDDING
Authors	Youzhi Tu, Man-Wai Mak, The Hong Kong Polytechnic University, Hong Kong SAR China
Session	SPE-37: Speaker Recognition 5: Neural Embedding
Location	Gather.Town
Session Time:	Thursday, 10 June, 14:00 - 14:45
Presentation Time:	Thursday, 10 June, 14:00 - 14:45
Presentation	Poster
Topic	Speech Processing: [SPE-SPKR] Speaker Recognition and Characterization
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	State-of-the-art speaker verification systems take frame-level acoustics features as input and produce fixed-dimensional embeddings as utterance-level representations. Thus, how to aggregate information from frame-level features is vital for achieving high performance. This paper introduces short-time spectral pooling (STSP) for better aggregation of frame-level information. STSP transforms the temporal feature maps of a speaker embedding network into the spectral domain and extracts the lowest spectral components of the averaged spectrograms for aggregation. Benefiting from the low-pass characteristic of the averaged spectrograms, STSP is able to preserve most of the speaker information in the feature maps using a few spectral components only. We show that statistics pooling is a special case of STSP where only the DC spectral components are used. Experiments on VoxCeleb1 and VOiCES 2019 show that STSP outperforms statistics pooling and multi-head attentive pooling, which suggests that leveraging more spectral information in the CNN feature maps can produce highly discriminative speaker embeddings.