2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information
Login Paper Search My Schedule Paper Index Help

My ICASSP 2021 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.
  1. Create a login based on your email (takes less than one minute)
  2. Perform 'Paper Search'
  3. Select papers that you desire to save in your personalized schedule
  4. Click on 'My Schedule' to see the current list of selected papers
  5. Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Paper IDSPE-32.2
Paper Title A FURTHER STUDY OF UNSUPERVISED PRETRAINING FOR TRANSFORMER BASED SPEECH RECOGNITION
Authors Dongwei Jiang, Wubo Li, Ruixiong Zhang, Miao Cao, Ne Luo, Yang Han, Wei Zou, Kun Han, Xiangang Li, Didi Chuxing, China
SessionSPE-32: Speech Recognition 12: Self-supervised, Semi-supervised, Unsupervised Training
LocationGather.Town
Session Time:Thursday, 10 June, 13:00 - 13:45
Presentation Time:Thursday, 10 June, 13:00 - 13:45
Presentation Poster
Topic Speech Processing: [SPE-GASR] General Topics in Speech Recognition
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Abstract The construction of an effective good speech recognition system typically requires large amounts of transcribed data, which is expensive to collect. To overcome this problem, many unsupervised pretraining methods have been proposed. Among these methods, Masked Predictive Coding achieved significant improvements on various speech recognition datasets with BERT-like Masked Reconstruction loss and transformer backbone. However, many aspects of MPC have yet to be fully investigated. In this paper, we conduct a further study on MPC and focus on three important aspects: the effect of pretraining data speaking style, its extension on streaming model, and strategies for better transferring learned knowledge from pretraining stage to downstream tasks. The experimental results demonstrated that pretraining data with a matching speaking style is more useful on downstream recognition tasks. A unified training objective with APC and MPC provided an 8.46% relative error reduction on the streaming model trained on HKUST. Additionally, the combination of target data adaption and layerwise discriminative training facilitated the knowledge transfer of MPC, which realized 3.99% relative error reduction on AISHELL over a strong baseline.