Paper ID | SPE-32.1 |
Paper Title |
HUBERT: HOW MUCH CAN A BAD TEACHER BENEFIT ASR PRE-TRAINING? |
Authors |
Wei-Ning Hsu, Facebook AI Research, United States; Yao-Hung Hubert Tsai, Carnegie Mellon University, United States; Benjamin Bolte, Facebook AI Research, United States; Ruslan Salakhutdinov, Carnegie Mellon University, United States; Abdelrahman Mohamed, Facebook AI Research, United States |
Session | SPE-32: Speech Recognition 12: Self-supervised, Semi-supervised, Unsupervised Training |
Location | Gather.Town |
Session Time: | Thursday, 10 June, 13:00 - 13:45 |
Presentation Time: | Thursday, 10 June, 13:00 - 13:45 |
Presentation |
Poster
|
Topic |
Speech Processing: [SPE-GASR] General Topics in Speech Recognition |
IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
Virtual Presentation |
Click here to watch in the Virtual Conference |
Abstract |
Compared to vision and language applications, self-supervised pre-training approaches for ASR are challenged by three unique problems: (1) There are multiple sound units in each input utterance, (2) With audio-only pre-training, there is no lexicon of sound units, and (3) Sound units have variable lengths with no explicit segmentation. In this paper, we propose the Hidden-Unit BERT (HUBERT) model which utilizes a cheap k-means clustering step to provide aligned target labels for pre-training of a BERT model. A key ingredient of our approach is applying the predictive loss over the masked regions only. This allows the pre-training stage to benefit from the consistency of the unsupervised teacher rather that its intrinsic quality. Starting with a simple k-means teacher of 100 cluster, and using two iterations of clustering, the HUBERT model matches the state-of-the-art wav2vec 2.0 performance on the ultra low-resource Libri-light 10h, 1h, 10min supervised subsets. |