Paper ID | SPE-38.1 |
Paper Title |
Contrastive Self-supervised Learning for Text-independent Speaker Verification |
Authors |
Haoran Zhang, Yuexian Zou, Helin Wang, Peking University, China |
Session | SPE-38: Speaker Recognition 6: Self-supervised and Unsupervised Learning |
Location | Gather.Town |
Session Time: | Thursday, 10 June, 14:00 - 14:45 |
Presentation Time: | Thursday, 10 June, 14:00 - 14:45 |
Presentation |
Poster
|
Topic |
Speech Processing: [SPE-SPKR] Speaker Recognition and Characterization |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
Current speaker verification models rely on supervised training with massive manually annotated data. But the collection of labeled utterances from multiple speakers is expensive and facing privacy issues. To open up an opportunity for utilizing massive unlabeled utterance data, our work exploits a contrastive self-supervised learning (CSSL) approach for text-independent speaker verification task. The core principle of CSSL lies in minimizing the distance between the embeddings of augmented segments truncated from the same utterance as well as maximizing those from different utterances. We proposed channel-invariant loss to prevent the network from encoding the undesired channel information into the speaker representation. Bearing these in mind, we conduct intensive experiments on VoxCeleb1&2 datasets. The self-supervised thin-ResNet34 fine-tuned with only 5% of the labeled data can achieve comparable performance to the fully supervised model, which is meaningful to economize lots of manual annotation. |