2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information
Login Paper Search My Schedule Paper Index Help

My ICASSP 2021 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.
  1. Create a login based on your email (takes less than one minute)
  2. Perform 'Paper Search'
  3. Select papers that you desire to save in your personalized schedule
  4. Click on 'My Schedule' to see the current list of selected papers
  5. Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Paper IDSPE-38.1
Paper Title Contrastive Self-supervised Learning for Text-independent Speaker Verification
Authors Haoran Zhang, Yuexian Zou, Helin Wang, Peking University, China
SessionSPE-38: Speaker Recognition 6: Self-supervised and Unsupervised Learning
LocationGather.Town
Session Time:Thursday, 10 June, 14:00 - 14:45
Presentation Time:Thursday, 10 June, 14:00 - 14:45
Presentation Poster
Topic Speech Processing: [SPE-SPKR] Speaker Recognition and Characterization
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Abstract Current speaker verification models rely on supervised training with massive manually annotated data. But the collection of labeled utterances from multiple speakers is expensive and facing privacy issues. To open up an opportunity for utilizing massive unlabeled utterance data, our work exploits a contrastive self-supervised learning (CSSL) approach for text-independent speaker verification task. The core principle of CSSL lies in minimizing the distance between the embeddings of augmented segments truncated from the same utterance as well as maximizing those from different utterances. We proposed channel-invariant loss to prevent the network from encoding the undesired channel information into the speaker representation. Bearing these in mind, we conduct intensive experiments on VoxCeleb1&2 datasets. The self-supervised thin-ResNet34 fine-tuned with only 5% of the labeled data can achieve comparable performance to the fully supervised model, which is meaningful to economize lots of manual annotation.