Paper ID | SPE-1.4 | ||
Paper Title | EFFICIENT KNOWLEDGE DISTILLATION FOR RNN-TRANSDUCER MODELS | ||
Authors | Sankaran Panchapagesan, Daniel Park, Chung-Cheng Chiu, Google, LLC, United States; Yuan Shangguan, Facebook, Inc., United States; Qiao Liang, Alexander Gruenstein, Google, LLC, United States | ||
Session | SPE-1: Speech Recognition 1: Neural Transducer Models 1 | ||
Location | Gather.Town | ||
Session Time: | Tuesday, 08 June, 13:00 - 13:45 | ||
Presentation Time: | Tuesday, 08 June, 13:00 - 13:45 | ||
Presentation | Poster | ||
Topic | Speech Processing: [SPE-LVCR] Large Vocabulary Continuous Recognition/Search | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Knowledge Distillation is an effective method of transferring knowledge from a large model to a smaller model. Distillation can be viewed as a type of model compression, and has played an important role for on-device ASR applications. In this paper, we develop a distillation method for RNN-Transducer (RNN-T) models, a popular end-to-end neural network architecture for streaming speech recognition. Our proposed distillation loss is simple and efficient, and uses only the “y” and “blank” posterior probabilities from the RNN-T output probability lattice. We study the effectiveness of the proposed approach in improving the accuracy of sparse RNN-T models obtained by gradually pruning a larger uncompressed model, which also serves as the teacher during distillation. With distillation of 60% and 90% sparse multi-domain RNN-T models, we obtain WER reductions of 4.3% and 12.1% respectively, on a noisy FarField eval set. We also present results of experiments on LibriSpeech, where the introduction of the distillation loss yields a 4.8% relative WER reduction on the test-other dataset for a small Conformer model. |