2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper IDSPE-38.3
Paper Title Self-supervised text-independent speaker verification using prototypical momentum contrastive learning
Authors Wei Xia, University of Texas at Dallas, United States; Chunlei Zhang, Chao Weng, Meng Yu, Dong Yu, Tencent AI Lab, United States
SessionSPE-38: Speaker Recognition 6: Self-supervised and Unsupervised Learning
LocationGather.Town
Session Time:Thursday, 10 June, 14:00 - 14:45
Presentation Time:Thursday, 10 June, 14:00 - 14:45
Presentation Poster
Topic Speech Processing: [SPE-SPKR] Speaker Recognition and Characterization
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Virtual Presentation  Click here to watch in the Virtual Conference
Abstract In this study, we investigate self-supervised representation learning for speaker verification (SV). First, we examine a simple contrastive learning approach (SimCLR) with a momentum contrastive (MoCo) learning framework, where the MoCo speaker embedding system utilizes a queue to maintain a large set of negative examples. We show that better speaker embeddings can be learned by momentum contrastive learning. Next, alternative augmentation strategies are explored to normalize extrinsic speaker variabilities of two random segments from the same speech utterance. Specifically, augmentation in the waveform largely improves the speaker representations for SV tasks. The proposed MoCo speaker embedding is further improved when a prototypical memory bank is introduced, which encourages the speaker embeddings to be closer to their assigned prototypes with an intermediate clustering step. In addition, we generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled. Comprehensive experiments on the Voxceleb dataset demonstrate that our proposed self-supervised approach achieves competitive performance compared with existing techniques, and can approach fully supervised results with partially labeled data.