Paper ID | MLSP-11.6 | ||
Paper Title | A Comparison of Discrete Latent Variable Models for Speech Representation Learning | ||
Authors | Henry Zhou, University of Toronto, Canada; Alexei Baevski, Michael Auli, Facebook AI Research, United States | ||
Session | MLSP-11: Self-supervised Learning for Speech Processing | ||
Location | Gather.Town | ||
Session Time: | Tuesday, 08 June, 16:30 - 17:15 | ||
Presentation Time: | Tuesday, 08 June, 16:30 - 17:15 | ||
Presentation | Poster | ||
Topic | Machine Learning for Signal Processing: [MLR-SSUP] Self-supervised and semi-supervised learning | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Neural latent variable models enable the discovery of interesting structure in speech audio data. This paper presents a comparison of two different approaches which are broadly based on predicting future time-steps or auto-encoding the input signal. Our study compares the representations learned by vq-vae and vq-wav2vec in terms of sub-word unit discovery and phoneme recognition performance. Results show that future time-step prediction with vq-wav2vec achieves better performance. The best system achieves an error rate of 13.22 on the ZeroSpeech 2019 ABX phoneme discrimination challenge. |