Paper ID | MLSP-11.6 |
Paper Title |
A Comparison of Discrete Latent Variable Models for Speech Representation Learning |
Authors |
Henry Zhou, University of Toronto, Canada; Alexei Baevski, Michael Auli, Facebook AI Research, United States |
Session | MLSP-11: Self-supervised Learning for Speech Processing |
Location | Gather.Town |
Session Time: | Tuesday, 08 June, 16:30 - 17:15 |
Presentation Time: | Tuesday, 08 June, 16:30 - 17:15 |
Presentation |
Poster
|
Topic |
Machine Learning for Signal Processing: [MLR-SSUP] Self-supervised and semi-supervised learning |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
Neural latent variable models enable the discovery of interesting structure in speech audio data. This paper presents a comparison of two different approaches which are broadly based on predicting future time-steps or auto-encoding the input signal. Our study compares the representations learned by vq-vae and vq-wav2vec in terms of sub-word unit discovery and phoneme recognition performance. Results show that future time-step prediction with vq-wav2vec achieves better performance. The best system achieves an error rate of 13.22 on the ZeroSpeech 2019 ABX phoneme discrimination challenge. |